cloudera toolkit (dark) 2018 · notable custom programs 3 of the top 5 us banks the largest us...

29
© Cloudera, Inc. All rights reserved. 데이터분석의 새 트렌드, 대중화 DataRobot / 홍운표 상무

Upload: buihuong

Post on 07-Mar-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

© Cloudera, Inc. All rights reserved.

데이터분석의 새 트렌드, 대중화

DataRobot / 홍운표 상무

© Cloudera, Inc. All rights reserved. 2© Cloudera, Inc. All rights reserved.

INTRODUCTION

© Cloudera, Inc. All rights reserved. 3© Cloudera, Inc. All rights reserved.

DATA SCIENCE WAVES

?

Source : https://blog.exploratory.io/data-science-by-you-dawn-of-third-wave-e89f2999d994

© Cloudera, Inc. All rights reserved. 4© Cloudera, Inc. All rights reserved.

GARTNER : DEMOCRATIZED BY AUGMENTED ML

democratized AI will be one of the major trends which will shape our future technologies.

© Cloudera, Inc. All rights reserved. 5© Cloudera, Inc. All rights reserved.

DEMOCRATIZATION ALREADY?

+38,000 명

고등학생이 ML 문제푸는 수준은 3년 전연구자들 수준…Source : https://www.youtube.com/watch?v=ZZXnecufXPU

딥러닝개발비용 < 신발값caffe 설치 10위안,cnn 층당 5위안,rnn 층당 8위안

Source: 중국 중관춘 (실리콘밸리)

+23,000 명

AI KOREA(Deep Learning)

© Cloudera, Inc. All rights reserved. 6© Cloudera, Inc. All rights reserved.

ACADEMY & OSG : AUTOMATIC MACHINE LEARNING

참고 : Efficient and Robust Automated Machine Learning,

Feurer et al., Advances in Neural Information

Processing Systems 28 (NIPS 2015).

Data Scientist community 에서활발히쓰이는 scikit-learn 과유사한 coding style

Parameter Search Space 를자동으로찾아줌

CRAN Package 마다다른 I/F를갖는 algorithm 들의 wrapping

다양한 Algorithm 들을포함하고있음 (>160)

(반) 자동화이나이전보다훨씬효율적인분석작업가능• 전처리 (결측치, 변환등) 및후처리작업• Hyper-parameter tuning

• Learning-curve 등모델링중관찰데이터

참고 : https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html

참고 : efficient neural architecture search

(https://arxiv.org/abs/1806.10282)

[network morphing 과정] [자동화된 NN 성능]

Auto-keras

© Cloudera, Inc. All rights reserved. 7© Cloudera, Inc. All rights reserved.

ENTERPRISE NEEDS : SCALING DATA SCIENCE

The small pool of data scientists and large amount of time needed to research, construct, and deploy models leaves many businesses unable to quickly deliver time-sensitive projects.

Predictive Algorithm Demand

Unmet Demand for Data Science

Supply of Internal Resources

Time

HIGH COSTS

HIGH TURNOVER

SLOW, COSTLY INTEGRATION

LESS INSIGHTS

© Cloudera, Inc. All rights reserved. 8© Cloudera, Inc. All rights reserved.

DATA SCIENTIST NEEDS : AGONIES

CRISP – DM방법론

① Iterationso 무한 반복o 종료 조건

★ 정확도 vs 설명 가능성o 현업 이해 가능한 설명o 복잡한 모델일수록 정확

★ Open-endedo 추가 데이터 확보o 어떤 데이터를 확보

② Comprehensiveo 모든 분석 모델 (<4)o Param. Tuning

③ Re-trainingo Growing erroro 데이터 변화

© Cloudera, Inc. All rights reserved. 9© Cloudera, Inc. All rights reserved.

WHY DATAROBOT FOR DEMOCRATIZATION

Confidential | Copyright © DataRobot, Inc. | All Rights Reserved

“ DataRobot is a machine learning platform

for analysts and data scientists to build

and deploy accurate predictive models in a

fraction of the time it used to take. ”

© Cloudera, Inc. All rights reserved. 11© Cloudera, Inc. All rights reserved.

DATAROBOT의 해답

Data Scientist 의생산성효율화, 기업내 AI 적용분야확산

HackingSkills

Math&

Stats

DomainExpertise

Do much more with little to no

coding

+

Expanded modelingtoolkit

© Cloudera, Inc. All rights reserved.Confidential. ©2018 DataRobot, Inc. – All rights reserved

$220M+

200+

IN FUNDING

750,000,000+MODELS BUILT ONDATAROBOT CLOUD

INSURANCE & BANKING HEALTHCARE FINTECH ON-DEMAND SERVICES MANY MORE

50+TOP 3 FINISHES

The world’s most advanced Automated Machine Learning platform

DATA SCIENTISTS & ENGINEERS (OF 450+)

2012FOUNDED HQ in Boston, MA

#1 RANKEDDATA SCIENTISTS

4

Notable Custom Programs

3 of the top 5 US Banks

The largest US for-profit Healthcare System

The largest US Supermarket chain

The largest US Pharmacy chain

The world’s largest Retailer

The world’s largest Auto Manufacturer

3 of the top 5 global Reinsurers

2 of the world’s largest Biotech companies

Global Telecommunication companies

Major League Baseball teams

Federal & Public Sector

Customer Success Tuned for Enterprise Customers

© Cloudera, Inc. All rights reserved. 14© Cloudera, Inc. All rights reserved.

DATAROBOT 솔루션의 특징 (1/4)

축적된분석지식과기술

Jeremy AchinCEO & Co-Founder

Highest: 20th

Xavier ConortChief Data Scientist

Highest: 1st

Tom de Godoy CTO & Co-Founder

Highest: 20th

Owen Zhang Product Advisor

Highest: 1st

Sergey YurgensonData Scientist

Highest: 1st

The top ranked Data Scientists in the world

MASTER MASTER MASTER MASTER MASTER

The best technologies in the world

Amanda SchierzData Scientist

Current: 1st Female, 1st in UK

MASTER

© Cloudera, Inc. All rights reserved. 15© Cloudera, Inc. All rights reserved.

DATAROBOT 솔루션의 특징 (2/4)

자동화된분석 :현업사용자도예측모델생성및활용가능

© Cloudera, Inc. All rights reserved. 16© Cloudera, Inc. All rights reserved.

DATAROBOT 솔루션의 특징 (3/4)

설명가능성 :모든 Algorithm각각에대해데이터기반,설명제시

[Feature Impact] [Feature Effect] [Prediction Explanation]

• 각변수들의중요도는어떻게다른가?

• 중요도의순위는업무지식과일치하는가?

• 새로운 insight가있는가?

• 각변수는 Target 과어떤관계인가?

• 함수관계는업무지식을반영하고있는가?

• 새로운 Insight가있는가?

• 예측은어떤근거로생성되는가?

• 모델의예측값은신뢰할만한가?

© Cloudera, Inc. All rights reserved. 17© Cloudera, Inc. All rights reserved.

DATAROBOT 솔루션의 특징 (4/4)

API를통한연동

Application server

Prediction worker

Notebook

RestAPI, R/Python

Model FactoryAutomatic

Model RefreshModel

Diags & VizFeature

EngineeringApp.

Integration

API를 활용한 분석 관련 다양한 작업 가능

Cloud

Hadoop

Web UIConsole

© Cloudera, Inc. All rights reserved. 18© Cloudera, Inc. All rights reserved.

CLOUDERA & DATAROBOT

Data Science Workbench

DataRobot

CDWS allows data scientists easy and secure access to data and distributed processing provided by Cloudera Enterprise. This enables data scientists to develop models in Python, R or Scala , without having to worry about the details of Hadoop and Spark. Focus is on the coding data scientist - CDSW can also leverage libraries available in DataRobot

DataRobot offers an automated machine learning platform that empowers users of all skill levels to make better predictions faster. The integration with DataRobot allows CDSW users to either build models manually in R and Python or utilize the Machine Learning Automation in DataRobot, all from the same workbench. Focus on both business analyst and data scientist

© Cloudera, Inc. All rights reserved. 19© Cloudera, Inc. All rights reserved.

CLOUDERA & DATAROBOT INTEGRATION

DataRobot has the highest level of integration with Cloudera

Cloudera Parcels A few click to install DR in Cloudera Manager!

Cloudera CSDs Can use all the functionalities of Cloudera Manager (monitoring, resource mgmt…)

Kerberos / Sentry Secured authentication

YARN All the resources consumed by DataRobot are managed by YARN

Spark DataRobot uses Spark for Hadoop scoring

© Cloudera, Inc. All rights reserved. 20© Cloudera, Inc. All rights reserved.

USE CASES

© Cloudera, Inc. All rights reserved.

Challenge: Reducing the need for human inspection in the processes that are

difficult to control

Fault Detection

Data: Grinding, hitting, etc., especially effective in the process where physics modeling is difficult

Results: Accurate alert when products are likely to have faults --the model refreshed hourly and deployed immediately to reflect the changes in machine settings

“Extremely high accuracy and highly automated process only possible with DataRobot”

- SI vendor working to implement the system at heavy industry manufacturer

Heavy industry manufacturing

© Cloudera, Inc. All rights reserved.

PredictiveMaintenance

Data: Data included age, construction materials, text description, location, power-grid, previous repairs, etc

Results: Allowed this energy company to predict most incidents that were unrelated to weather (chance)

Challenge: Optimizing maintenance cost by predicting failing asset

Gas utility company

The ability to predict failing assetreduces the need for human inspection

© Cloudera, Inc. All rights reserved.

Sales Forecasting

Data: Time series sales data for thousands of products

Results: Allowed forecasting of all products not just few

Challenge: Preventing opportunity loss while minimizing excess-production

International Retail

Accuracy over 80% for over 70% of productsAs good as human expert prediction

© Cloudera, Inc. All rights reserved.

More accurate results achieved in 4 hours vs. 2 weeks;

85% vs. 64% (AUC)

Portfolio ROI = $10M per year

Claim cost savings by rejecting riskiest patients

Identifying simple underwriting rules to segment patients

● Replaced inaccurate & hard-to-maintain medical expert rules

Insurance Underwriting

Identifying 10% of customers with 5x higher than average mortality risk

GLOBAL REINSURANCE COMPANY

© Cloudera, Inc. All rights reserved.

Potential very large ability to reduce big cost in claim

More accurate models built faster

REST API: faster, simpler deployment

Identifying claim fraud to support payments

We’ve looked at just about every viable vendor in this space &

we have not seen anyone do what DataRobot can do. - SVP of Technology Innovation

Fraud Detection

© Cloudera, Inc. All rights reserved.

Customer Churn

Potential $10M in additional revenue

Increased accuracy in targeting high churn risk customers

Better identification of customers who can be persuaded to stay

Faster data analysisTargeting customers likely not to

renew the next contract

We cannot find or pay for the data scientist necessary to

accomplish our goals, but with DataRobot we can get there. -SVP of Data Analytics

© Cloudera, Inc. All rights reserved. 27© Cloudera, Inc. All rights reserved.

LIVE DEMO

© Cloudera, Inc. All rights reserved. 28© Cloudera, Inc. All rights reserved.

LIVE DEMO DATA

대출 Risk 모델링

• Problem

• 대출 신청자의 Profile 기반으로

• 최적화된 승인/거절에 활용하기위한 Default Risk를 예측 모델

• Data

• 대출 정보 (신청액, 상환 기간)

• 개인 정보 (직장, 연봉, 주소 등)

• 과거 신용 정보 (계좌수 등)[LeadingTree 사례]

© Cloudera, Inc. All rights reserved.

THANK YOU

Woonpyo HongData Scientist, DataRobot

[email protected]