빅 데이터, 새로운 통찰력

486
TTA, 빅데이터 교육 2012. 10. 17 [email protected] 한국정보화진흥원 빅데이터 전략연구센터장

Upload: -

Post on 14-Jul-2015

768 views

Category:

Software


8 download

TRANSCRIPT

  • TTA,

    2012. 10. 17

    [email protected]

  • 2

    4

    5

    6

    1

    3

  • DB DB DB

  • Calculating Database Online Ubiquitous

    ICT

    Intelligence

  • , SNS

    IT

    , ,

    (embedded system)

  • ()

    CIO ERP

    (Real Analytics)

    (Mobile First) IT, IT

    IT , IT ,

  • PC

    /

    EB(Exa Byte) (90 =100EB)

    ZB(Zetta Byte) (2011=1.8ZB)

    ZB (20=11 50 )

    (, )

    (, , SNS)

    , (RFID, Sensor, )

    , , ,

    2011 1.8ZB()

    1.8 = 1.8

    2020 50

    SNS Web2.0

    1 1PC

    www

    (IDC & EMC, Digital Universe Study 2011)

    IT everywhere

    * Byte, Kilo, Mega, Giga, Tera, Peta, Exa, Zetta

    1ZB() = 1021 Byte = 1 GB

    (, , 2012. 3)

  • (Big Data)'

    Volume Variety Velocity

    Complexity Value

  • ( )

    (Hadoop, NoSQL, R )

    ,

    (, ,

    , , GPS )

    (,

    )

    3V

    ++

    : (2012), , IT 3

  • 3

    (Big Data Platform)

    , (NoSQL, ETL..)

    (Hadoop, MapReduce..)

    ( , , ..)

    (Visualization)

    (Big Data)

    (Data Scientist)

    , (IT )

    , ,

  • (

    )

    ,

    Silos

    Sharing

    Aggregating

    Co-creating

  • EU

    (www.data.go.kr)

    , 'Data.gov

    65

    Data.gov

    (ODS: Open Data Strategy) (11. 12)

    EU 2013 pan-European

    2.0 (data.gov.au)

  • : (2011), Social Big Data & Collective Intelligence'

    :

    : :

  • : (2012), Big Data

  • (Hadoop)

    (HDFS),

    (MapReduce)

  • : KT

  • (Mathematics, Statistics..)

    (Engineering, Computer Sciences, Natural Sciences, Social Sciences)

    : Forbes, 'Amazon's John Rauser on "What Is a Data Scientist?"'(2011.10.7), , ..., (2012. 3. 18)

    6

  • : HARD Skill : SOFT Skill

    : , , , IT & Future Strategy, , 2012. 8.

  • Network World IT , , , ,

    , , ,

    Data Scientist

  • - Chief Economist, Hal R. Varian -

  • .

    : , , , 2012.6

  • (Hadoop)

    IT

    BI ,

    BI , , , ,

    : , , , 2012.6

  • (, , 2012. 3)

    ?

  • `

    DB KMS Web2.0

    < '' ' >

    --

    2011 2 (Jeopardy!)' IBM '(Watson)'

    ,

  • ,

    , ,

    (Huge Scale)

    (Reality)

    (Trend)

    (Combination)

    ,

    , ,

    ,

    (, , 2012. 3)

  • Economist

    (2010)

    ,

    Gartner

    (2011)

    21 ,

    (Information Silo)

    McKinsey

    (2011)

    , ,

    , 5 6

    ,

  • ,

    ,

    ,

    ,

    ,

    ,

    ,

    (, , 2012. 3)

  • : , , , 2012.6.

  • IT

  • google.com

    50

    () () 25,000

    1 7,000 1

    4 4

    20 100,000

    Gmail (SNS)

    845

    Google.org (, )

    OS OS

    G1 (Knol)

    236

    TV

    S 380 30

  • TV

    Google

    (, )

  • Data Strategy Board

    (BIS, 2012. 3) - - ,

    Open Data Strategy

    - , - ,

  • - 10 12~15 - Ad Hoc Group

    : Active Japan ICT , 39-3-2

  • , , ,

    , ,

    ,

    , SW

    , R&D

    , , ,

    7

  • : $3,300

    : 60%

    : Mckinsey(2011)

    : 10

    : 12~15

    : (2012)

    : 10 7

    : (2011)

    : 160~330 ( 2.5~4.5%)

    : Policy

    Exchange(2012)

    EU

    : 2,500

    : McKinsey(2011)

  • , Big data,

  • , IT , (2012.4.23)

  • Calculating Database Online Ubiquitous

    ICT

    Intelligence

    Q: ?

    A: , , B: , , ,

  • 2012 IT IT

  • /

    /

    /

    /

    /

    /

    IT

  • :

  • (, & , Gov3.0 , 2012. 6)

  • 8 (71% ) : &

  • + GPS

  • IT!

  • /

    /

    / /

    /

  • 1. , , ,

    2. -> vs ->->

    3. :

    4. +++

    5.

  • 1. , , ,

    2. : , , , ,

    3. , &

    4. , ,

    5. : ;

  • www.bigdataforum.or.kr

  • : , (2011. 11. 7)

  • (Mathematics, Statistics..)

    (Engineering, Computer Sciences, Natural Sciences, Social Sciences)

    : Forbes, 'Amazon's John Rauser on "What Is a Data Scientist?"'(2011.10.7), , ..., (2012. 3. 18)

    !

  • [email protected]

  • 0/88 0

    ETRI Proprietary Electronics And Telecommunication Research Institute

  • 1/88

    -

  • 2/88

    ?

    : , ,

    Data Mining

    Text Mining

    Log Mining

    Bio/Medical Mining

    Stream Mining

  • 3/88

    21 :

    : 2011, 1.8ZB 2020, 35ZB (44 , 1ZB = 1GB)

    21 (Gartner, 2011)

    5%

    : Economist, Gartner, IDC, McKinsey, Nature Next Google

    21 Information silo

    Gartner (2011.03)

    / , / , 5 6

    Mckinsey (2011.05)

    Big data: The next frontier Tor innovation, competition, and productivity

    SNS M2M , , ,

    Economist (2010.05)

  • 4/88

    1. Business application data (e.g., records, transactions)

    2. Human-generated content (e.g., social media)

    , , ,

    3. Machine data (e.g., RFID, Log Files etc.)

  • 5/88

  • 6/88

    21 (Gartner)

    : Risk Assessment Horizon Scanning

    : Evidence-driven decision support

    Value

    (//)

    Horizon Scanning Advanced Analytics Decision Support

  • 7/88

    ?

    5 : (US), (EU), LBS , , : Mckinsey, 2011

  • 8/88

    ?

  • 9/88

  • 10/88

    ,

    -- : ,

    :

    :

  • 11/88

    () / () / , ()

    ()

    ,

    , ,

  • 12/88

    -

  • 13/88

    ,

    /

    //

    ?

  • 14/88

    Data Mining, Predictive Analytics

    Text Mining, Question Answering

    Opinion Mining, Social Media Analytics, Social Network Analytics, Predictive Analytics

    Log Data Mining

    Modelling & Simulation

  • 15/88

    (1) Data Mining

    (Association rule mining)

    Market basket analysis

    (Classification) : , Buying decision, churn rate, consumption rate

    (Regression) , ,

    (Cluster analysis) Segmenting customers into similar groups for targeted marketing

    (Novelty Detection) Fault detection, Fraud detection

    Red Ocean: SAP, IBM, SAS, Oracle, Microsoft

  • 16/88

    (2) vs.

    : ) /, /, /

    : ) , ,

  • 17/88

    : (Classification)

    (Class) , (Class) , ,

  • 18/88

    : (Regression)

    ,

    X

    Y

    X

    Y

    37

    ?? 33

  • 19/88

    Google Prediction API

    Googles cloud-based machine learning tools can help analyze your data to add the following features:

    Fords Smart Car System

  • 20/88

    Predicting the Present with Google Trends

    Can Google queries help predict economic activity? Google Trends provides an index of the volume of Google queries by

    geographic location and category.

    Google classifiers search queries into 27 categories at the top level and 241 categories at the second level.

    GNU R

  • 21/88

    Google

    10 (2009) ,

    20

  • 22/88

    Google

    Google

    Google 18

  • 23/88

    [] GNU R Programming Language

    R is an open source programming language and software environment

    for statistical computing and graphics.

    S

  • 24/88

    (3) Text Mining

    Goal: to turn text into data for analysis via application of natural language processing (NLP) and analytical methods.

    Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation,

    information extraction, data mining techniques including link and

    association analysis, visualization, and predictive analytics.

    , , , ,

    , , , , , 10

  • 25/88

    [] Apache UIMA

    UIMA Architecture Frameworks:

    support configuring and running pipelines of Annotator

    components

    Components (i.e., Annotators):

    do the actual work of analyzing the unstructured information

    Infrastructure:

    include a simple server that can receive requests and return

    annotation results, for use by

    other web services.

  • 26/88

    (4) Opinion Mining

    Opinion Mining or Sentiment Analysis

  • 27/88

    Opinion Mining

  • 28/88

    Opinion Mining

  • 29/88

    Opinion Mining

    Application of Sentiment Analysis Business Intelligence system

    Purchase planning

    Public opinion management

    Web advertising

  • 30/88

    Aspect-based Opinion Mining

    Aspect Identification Aspect Expression Extraction Aspect Expression Clustering Aspect Hierarchy Generation

    Value Expression Extraction {Aspect, Value} Relation Extraction

    Implicit Aspect Identification

    {Aspect, Value} Polarity Assignment

    30

    Terminology

    Aspect

    : { , , , , } Aspect Expression

    .: { , , } Value Expression ( value)

    .: { , , , }

  • 31/88

    Aspect Hierarchy Generation

    optimization approach

    Domain-Assisted Product Aspect Hierarchy Generation: Towards Hierarchical Organization of Unstructured Consumer Reviews [2011 EMNLP] 31

  • 32/88

    (5) Question Answering

    :

    :

    (Answer Engine) IT

    Life is about questions & answers.

    -> Decision making

  • 33/88

    IBM Watson QA

    Watson , ,

    Deep QA- ()

    SW

    -> 3 ( 2~6)

    (2.6GHz) 2

    -> 1(200 )

    Apache Hadoop

    Apache Lucene

    Apache UIMA(Unstructured Information Management Architecture)

    Deep QA -> 100

    33

  • 34/88

    IBMs Grand Challenges

    Chess -> Human Language

    SW (2) , Big data deep analytics Deep QA

    HW (1) IBM Power750 90(2,880 ) Deep blue 100 2010 Top 94 (80TFs)

    SW HW Deep Blue

  • 35/88

    Jeopardy! Questions

    < Game Board Category: US Cities> Hard Question

    Simple Question

  • 36/88

    Waston QA

    : 3 vs. 0.4

    Watson can never be sure of anything

    Question Difficulty

    Usability (, , )

    Content Language Difficulty

    Confidence

    Accuracy

    Speed

    Broad Domain

    Query Language Difficulty

  • 37/88

    Waston for Business Intelligence

    , , , Insight

  • 38/88

    IBM ?

    Do they accomplish human-like language processing? Paraphrase an input text

    Translate the text into another language

    Answer questions about the contents of the text

    Draw inferences from the text

    Truing test proposed by Alan Turing (1950) Waston has not met Turings standard or true AI.

    It does not have the intelligence to understand the questions & the answers.

    However, Waston is cerainly intelligence argument (IA) that extends human brains.

    : IBM

  • 39/88

    Wolfram Alpha

    Wolfram Alpha supports Apple's Siri for factual question answering

    Siri now accounts for 25 percent of all searches made on Wolfram Alpha (NY Times, 2012.2.7)

  • 40/88

    Google Knowledge Graph

    Googles next frontier for search

  • 41/88

    (6) Log Data Mining: Personal Location Data

    Personal Location Data Mining

  • 42/88

    Log Data Mining: Web Log Data

    Google Insights ()

    Big data

  • 43/88

    (7) Social Network Analysis

  • 44/88

    (8)

    1. Predict Risk

    2. Predict Market

    3. Predict Popularity

    4. Predict Mood

    5. Predict Social Dynamics

  • 45/88

    Predict Risk

    , , Natural Risk(Storms, files, traffic jams, riots, earthquakes etc.)

    (249) Earthquake Shakes Twitter User:Analyzing Tweets for Real-Time Event Detection, IW3C2, 2010

    (88) Microblogging during two natural hazards events: what twitter may contribute to situational awareness, CHI, 2010

    Financial Risk

    (27) Predicting risk from financial reports with regression, NAACL, 2009

    (2) Hunting for the black swan: risk mining from text, ACL, 2010

  • 46/88

    Predict Market

    , , (Wisdom of crowds) Social Media, News PM

    (9) Predicting Movie Success and Academy Awards Through Sentiment

    and Social Network Analysis, 2008, ECIS

    (124) Predicting the future with social media, 2010 (5) Using Social Media to Predict Future Events with Agent-Based Markets,

    2010, IEEE

    (130) Twitter mood predicts the stock market, 2010, journal of CS Predicting Financial Markets: Comparing Survey,News, Twitter and Search

    Engine Data, 2011

    (16) Reading the Markets: Forecasting Public Opinion of Political

    Candidates by News Analysis, 2008, Coling

    (106) Predicting Elections with Twitter:What 140 Characters Reveal about Political Sentiment, AAAI, 2010

  • 47/88

    Predict Popularity

    social connection, link structure, user behavior pattern ( )

    Digg, Youtube (22) Digging Digg : Comment Mining, Popularity Prediction, and Social Network

    Analysis, IEEE, 2009 Dig ( , , ) digg-score

    (111) Predicting the Popularity of Online Content, ACM, 2010

    (Digg: 1 , Youtube: 7 ) 30

    Forum.myspace.com, Forum.dpreview.com (9) An Approach to Model and Predict the Popularity of Online Contents with

    Explanatory Factors

    France News sites (2) Predicting the popularity of online articles based on user comments, ACM,

    2011

    Twitter (23) Trends in Social Media - Persistence and Decay, AAAI, 2011

    - , , , 2012

  • 48/88

    Predict Mood

    Sentiment ,

    Global mood phenomena: ( )

    Public mood

    Mood modeling

    (80) Capturing Global Mood Levels using Blog Posts, 2006, AAAI

    (66) Modeling Public Mood and Emotion-twitter sentiment and socio-economic phenomena, 2009, AAAI

    (1) Effects of the recession on public mood in the UK, 2012, WWW MSDN worshop

  • 49/88

    Predict Social Dynamics

    Unemployment through the Lens of Social Media : ,

    (2009.6.~2011.6)

    : ,

    : Un , SAS

    40 5 , 6 90%

  • 50/88

    Recorded Future: Temporal Analytics Engine

    Event Entity Time

    CIA 2008

    () (, ) /

    () (: , ) .

    () ,

  • 51/88

    (Ushahidi)

    Ushahidi: , /

    2007, ,

    a tool to easily crowdsource information using multiple channels, including SMS, email, Twitter and the web.

    , ,

    ++

    ,

    ,

    51

  • 52/88

    (9) Modelling & Simulation

    RAHS

    - RAHS(Risk Assessment & Horizon Scanning)

    - ,

    - 11 RAHS 2.0

    9.11

    ,

  • 53/88

    -

  • 54/88

    -/

    - ? ,

    , , , (1012) (SERI, 2010)

    , , , ,

  • 55/88

    - ? /,

    Insight

    : , , , ,

    : /

    ()

    ?

    () S2 ?

    (++ + )

  • 56/88

    -, ,

    -

    -/ ( )

    //

  • 57/88

    ,

    () 6

    .

    (, , )

    /nc /nc+/xsn+/jc /nc+/nc+/jj /nc+/jc /nc+gk/Xsv+/ec /pv+ /ep+/ef ./s

    +/xsn+/jc+/jj +/jc /nc+gk/Xsv+/ec /pv+ /ep+/ef ./s

    Verb():Arg1( ), Arg2( )

    /

    Entity: Object: , Value:

    : (-9.5)

  • 58/88

    -

    , Insight /

    : , ,

    1.

    2.

    3.

  • 59/88

    , (Evidence-driven)

    :

    :

    :

  • 60/88

    Insight Delivery

    Issue Predictive Analytics

    Knowledge Analysis

    Information Analysis

    Data Sensing

    /

    /

    /

    /

    -

    SNS

    / / /

    /

  • 61/88

    1 2 (12/9 )

    98 187

    39 67

    39 92

    43 99

    /

    //

    Hadoop HBase

    (Crawling API, Streaming API)

  • 62/88

    , , ,

    :

    Follower, Mention, Retweet PageRank ,

    /

    (SVM)

  • 63/88

    ,

    ,

    //

    , ,

    (B)

    Depth Retwee

    t

    (/)

    Nested

    network

    Depth Retweet

    (/)

    Nested

    network

    (A)

  • 64/88

    /

    -

    - //

    - ()

    -

    - /// /

    - ////

    ()

  • 65/88

    /

    (, SNS )

    ( )

    (2)

  • 66/88

    /

    /

    /

    /

    /

    /

    ,

    Transition-based parsing hash kernel , ( O(n^3) O(n): 8 ) Deterministic parser beam search

    180 () 4 () Structural SVM

    (2)

  • 67/88

    -/, -, -

    /

    SRL

    ,

    / ,

    / * SRL: Semantic Role Labeling

    XX

    S2

    .

    (2)

  • 68/88

    [// /()/()/]

    Holder

    Target

    Aspect

    Time

    Sentiment

    Trigger:

    Anchor:

  • 69/88

    [] Theory of emotion

    () () ()

    ()

    () () ()

    () () ()

    () ()

    () ()

    () ()

    () ()

    [Plutchik's wheel of emotions: eight primary emotions] [ ]

  • 70/88

    17

    /// /

    /

    Trigger

    Sentiment Shifter(, )

    NEGATIVE POSITIVE NEUTRAL

  • 71/88

    /

    Sentiment Shifter(, )

  • 72/88

  • 73/88

  • 74/88

    (Seed)

    ?

    ?

    :

    :

    (, )

  • 4.11

    3

    1

    5.16

    : 2012 1-8 : 314,648,676 : 26,438,236(8.4%)

    (8/11), . (7/31),

    (4/5) 4.11(4/11). 3(5/24)

    /

    3 4 5 6 7 8

  • 76/88

    / /// /

    Competitive Intelligence

  • 77/88

    []

  • 78/88

    -

    , Insight /

    : , ,

    0.0000

    0.2000

    0.4000

    0.6000

    0.8000

    1.0000

    1.2000

    1 2 3 4 5 6 7 8

    11

    :

    (/ )

    ,

    : 46,768

  • 79/88

    Novelty(h1): ? discrepancy score

    Importance(h2): ? term

    Strength(h3): ? //

    Confidence(h4): ? source

    Interestedness(h5): ? , , RT

  • 80/88

    []

    12/22: A

    11/23:

    12/30: A

    A vs

    [A ]

    [ETRI-WISDOM]

  • 81/88

    , (Evidence-driven)

    :

    :

    :

    ()

  • 82/88

    -

    / /

    SNS

    ARIMA: Autoregressive Integrated Moving Average

    ECM: Error Correction Model

    (ARIMA, ECM )

    (, ) DB

    (, )

    -: / -: /

    -: -: /

  • 83/88

    (1/6)

  • 84/88

    vs.

    ( )

    /

    (, , )

    /

    /

  • 85/88

    -

  • 86/88 86

    , - SNS , , , / Reasoning, ,

    ,

    SW SW 2 10% (SERI, 2010)

    Data-driven Insight / , ,

  • 87/88

    [] 5 Big Data Questions For CEOs

    1. How is big data going to help my business?

    2. How much will it cost?

    3. How risky is it?

    4. How will we measure the return?

    5. How long will it take to see results?

    : http://www.forbes.com/sites/ciocentral/2012/06/26/5-big-data-questions-for-ceos/

  • 88/88

    . Q&A

  • Big Data

    Hadoop

    Edward KIM

    [email protected]

  • (JCO) 6 ( )

    JBoss User Group

    Architect

    Hadoop Java EE

    Open Flamingo (http://www.openflamingo.org)

    Java Application Performance Tuning

    IT

    JBoss Application Server5, EJB 2/3

    Oreilly RESTful Java

    2

  • 3

  • ?

    4

    Insight, Context, Data Scientist

    Early Adaptor Collector .

  • ?

    5

    10G? 50G? 100G?

    1T? 10T? 50T? 100T?

    1P ?

    10

    100 Byte * 6(1) * 60(1)* 24(1) * 600

    = 864,000 * 6,000,000 = 5,184,000,000,000 Bytes

    = 494,3847M = 4,827G (1 )

  • Big Data

    6

    +++

    H/W + S/W

    DevOps

  • Big Data ?

    7

  • Big Data

    8

    Platform

    Service

  • Big Data OpenSource

    9

    Big Data

  • ?

    10

    IT

  • ?

    11

  • Apache Hadoop

    File System : HDFS(Hadoop Distributed File System)

    64M

    2003 Google Google File System

    (MapReduce) (2004 Google )

    HDFS

    Parallelization, Distribution, Fault-Tolerance

    12

  • Hadoop

    13

    !

    ) MapReduce Sorting Sorting

    Local Sorting Out Of Memory

  • Apache Hadoop Architecture

    14 Manning Hadoop In Practices

  • Apache Hadoop ?

    / .

    .

    I/O CPU .

    .

    linear .

    linear .

    .

    Apache Hadoop .

    Intel Core .

    15

  • Hadoop, RDMBS

    16

    Big Data

    .

  • Hadoop

    17

  • Hadoop

    ETL(Extract, Transform, Load)

    Data Warehouse

    Storage for Log Aggregator

    Distributed Data Storage (; CDN)

    Spam Filtering

    Bioinformatics

    Online Content Optimization

    Parallel Image, Movie Clip Processing

    Machine Learning

    Science

    Search Engine

    18

  • Apache Hadoop

    19

  • Apache Hadoop

    20

  • Apache Hadoop

    21

  • Apache Hadoop

    22

  • Hadoop Cluster

    2 CPU(4 Core Per CPU) Xeons 2.5GHz

    4x1TB SATA

    16G RAM

    1G

    10G

    20

    Ubuntu Linux Server 10.04 64bit

    Sun Java SDK 1.6.0_23

    Apache Hadoop 0.20.2

    23

    3~4

    - HDD Crash

    - Kernel Crash

    - LAN Fail

  • Big Data Appliance Hardware

    18 Sun X4270 M2 Servers

    48 GB memory per node = 864 GB memory

    12 Intel cores per node = 216 cores

    36 TB storage per node = 648 TB storage

    40 Gb p/sec InfiniBand

    10 Gb p/sec Ethernet

    24

    Processors 2 Six-Core Intel Xeon X5675 Processors (3.06 GHz)

    Memory 48GB (6 * 8GB) expandable to 96 GB or 144

    Disks 12 x 3 TB 7.2K RPM High Capacity SAS (hot-swap)

    Disk Controller Disk Controller HBA with 512MB Battery Backed Cache

    Network 2 InfiniBand 4X QDR (40Gb/s) Ports (1 Dual-port PCIe 2.0 HCA)

    4 Embedded Gigabit Ethernet Ports

  • Hadoop Ecosystem

    25

  • Hadoop

    26

    Hadoop . Google Compute Engine

    !!

  • Hadoop

    27

    Database

    Hadoop

    Analytics

    Hadoop

    New

    Service

    &

    Platform

    Architecture

    Integration

    Performance

    Cost

    Development

    Data

    Analytics

    Practices

    Focus Issue Project

  • SK Telecom Hadoop

    28

    AS-IS Oracle RAC Database Big Data (100 Tera Bytes)

    3 Layer(Sub System)

    Service Adaptation Layer(SAL)

    KD CL

    Open API XML

    Collection Layer(CL)

    ETL,

    Knowledge Discovery(KD)

    (; K-Means)

    Big Data Analytics, Data Scientist

    ,

    TO-BE Apache Hadoop

    KD, CL Hadoop Migration

    , , ,

  • SK Telecom Hadoop

    29

    Big Data Platform

    Apache Hadoop, Pig, Hive

    Workflow Engine & Designer, HDFS Browser

    MapReduce based Mining Algorith, ETL

    AR, CF, K-Means,

    Service Platform

    Melon :: Association Rule

    T store, AppMercer :: CF, Cold Start, Association Rule

    Hoppin :: Real-Time Mining, CF, Cold Start

    NATE

    Vingo

    Ad Platform

    100 segmentation

    .

  • SK Telecom Hadoop

    30

  • SK Telecom Hadoop

    31

  • SK Telecom Hadoop

    32

    / Best, Best

    T store 20 , 0.05%

    14%

    Apple App Store 1000

    1.76%

    Android Market Top 50 60%

    ,

    Top 10 (Cold Start)

  • SK Telecom Hadoop

    T store

    Collaborative Filtering

    Association Rule

    Cold Start

    AS-IS

    AS-IS

    TO-BE

    Hadoop

    33

  • SK Telecom Hadoop

    34

  • SK Telecom Hadoop

    35

    Melon

  • Melon

    36

  • 37

    SK Telecom Hadoop

    Oracle Hadoop

    CPU 100% 70%

    Core 80 Core Intel 8 Core * 20 = 160 Core

    1 34

    1 1

    120,000,000

    (T) 1,300,000

    6 High End Server

    300 * 20 = 6,000

    ) Core 700 * 80 = 56,000

    0

  • SK Telecom Hadoop

    Hoppin N

    38

  • SK Telecom Hadoop

    Hoppin

    Real-Time

    Action ) ,

    Collaborative Filtering, Cold Start

    , ,

    Text Mining

    ()

    39

  • SK Telecom Hadoop

    40

    - -

    User Preference

    Streaming - Data Grid -

    Implementation

    A

    B

    C

    D

    E

    Rock R&B K-POP J-POP Soul

    5 6 4 1 6 0

    Rock R&B K-POP J-POP Soul

    4 2 1 4 2 1

    Rock R&B K-POP J-POP Soul

    5 6 3 2 1 1

    Rock R&B K-POP J-POP Soul

    1 5 6 2 3 0

    User Preference

  • Real Time Big Data

    41

  • Use-Case: Dispenser

    42

  • Use-Case: Dispenser

    43

  • Facebook Real Time Analytics System

    44

  • Apple iOS6 Maps

    45

  • 46

    Big Data 4 3 Realtime Big Data

    Realtime & Big Data

    SI

    , , ,

    Big Data

    Big Data

    Big Data

  • 47

    1 (2004.04~) :: SW

    SW

    NEIS Linux

    SW

    2 (2009.04~) :: SW

    SW

    SW

    3 (2012.10~) :: SW , , SW

  • 48

    SW

    SW SW

    SW

    SW

    SW

    SW /

    SW

    SW R&D SW /

    SW

  • NIPA :: Architecture Reference Model

    49

    , , ,

    OpenSource

    , ,

    ,

    AS-IS, TO-BE Architecture

    : Hadoop, Pig, Hive, MongoDB, Slurper, Oozie, Sqoop, Storm, Flume, Ganglia, RHQ

    Big Data Slurper Collector

  • Hadoop Project

    50

    No Experience

    HW & SW tightly

    coupling

    Installation

    & Configuration

    Performance

    Tuning

    Provisioning

    Integration

    Trade Off

  • Apache Hadoop HDFS Architecture

    51 Manning Hadoop In Practices

  • MapReduce Logical Architecture

    52

  • WordCount

    Hadoop MapReduce Framework

    ROW Word Word

    53

    (Mapper Input) (Reduce Output)

    hadoop apache page hive hbase cluster hadoop page cloud copywrite

    apache 1 cloud 1 cluster 1 copywrite 1 hadoop 2 hbase 1 hive 1 page 2

  • WordCount

    54

  • Apache Pig

    = Pig Latin

    MapReduce

    Pig Latin MapReduce

    MapReduce

    Bag, Tuple,

    55

  • Pig Latin

    56

    -- max_temp.pig: Finds the maximum temperature by year

    records = LOAD 'input/ncdc/micro-tab/sample.txt'

    AS (year:chararray, temperature:int, quality:int);

    filtered_records = FILTER records BY temperature != 9999 AND

    (quality == 0 OR quality == 1 OR

    quality == 4 OR quality == 5 OR quality == 9);

    grouped_records = GROUP filtered_records BY year;

    max_temp = FOREACH grouped_records GENERATE group,

    MAX(filtered_records.temperature);

    DUMP max_temp;

    (1950,0,1)

    (1950,22,1)

    (1950,-11,1)

    (1949,111,1)

    (1949,111)

    (1950,22)

    (1949,{(1949,111,1),(1949,78,1)})

    (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

  • Apache Hive

    Data Warehouse Infrastructure

    Data Summarization

    Ad hoc Query on Hadoop

    MapReduce for Execution

    HDFS for Storage

    MetaStore

    Table/Partition

    Thrift API

    Metadata stored in any SQL backend

    Hive Query Language

    Basic SQL : Select, From, Join, Group BY

    Equi-Join, Multi-Table Insert, Multi-Group-By

    Batch Query

    https://cwiki.apache.org/Hive/languagemanual.html 57

  • Hive QL

    SQL DDL Operation

    HDFS

    58

    hive> CREATE TABLE rating (userid STRING, movieid STRING, rating INT) ROW

    FORMAT DELIMITED FIELDS TERMINATED BY ^' STORED AS TEXTFILE;

    https://cwiki.apache.org/Hive/languagemanual-ddl.html

    hive> LOAD DATA INPATH '/movielens/ratings.dat' OVERWRITE INTO TABLE

    ratings;

  • Hive QL

    59

    hive> INSERT OVERWRITE DIRECTORY '/movielens/ratings.dat' SELECT r.* FROM ratings r WHERE a.movieid=1212'; hive> SELECT t1.bar, t1.foo, t2.foo FROM movies m JOIN ratings r ON (m.movieid = r.movieid)

    hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*)

    FROM invites a

    WHERE a.foo > 0 GROUP BY a.bar;

    hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out'

    SELECT a.* FROM invites a

    WHERE a.ds='2008-08-15';

  • Big Data

    Hadoop

    Hadoop Project

    ) MapReduce ~ ~

    Hadoop

    60

  • Hadoop

    ( )

    !

    Hadoop, Pig, Hive !

    !

    !

    61

  • 62

    ,

  • 63

    SI

  • Big Data Market Forecast

    64

  • Big Data Revenue

    65

  • Big Data Market Share

    66

  • Big Data Revenue By Type

    67

  • Hadoop

    Software Maestro 3rd [email protected]

    September 17, 2012

    (SW Maestro) Hadoop September 17, 2012 1 / 47

  • Section 1

    (SW Maestro) Hadoop September 17, 2012 2 / 47

  • 1 , HDFS HDFS .

    2 Lucene TF-IDF(TermFrequency-Inverse Document Frequency) , MapReduce .

    (SW Maestro) Hadoop September 17, 2012 3 / 47

  • 3 .

    1 - , (HDFS) .

    2 - Hadoop Full-Text (TF-IDF).

    3 - , .

    (SW Maestro) Hadoop September 17, 2012 4 / 47

  • Section 2

    (SW Maestro) Hadoop September 17, 2012 5 / 47

  • (Crawler)

    1 HDFS .

    2 ( ) .

    3 URL .

    4 robots.txt .

    5 IT ,Hadoop .

    6 .

    (SW Maestro) Hadoop September 17, 2012 6 / 47

  • , Manager Worker .

    Manager .

    , , .

    Worker .

    Raw Data HDFS .

    Manager .

    , Manager .

    (SW Maestro) Hadoop September 17, 2012 7 / 47

  • Section 3

    (SW Maestro) Hadoop September 17, 2012 8 / 47

  • TF-IDF

    .

    TF(Term Frequency) IDF(Inverse Document Frequency) .

    TF-IDF TF-IDF . , .

    .

    , . .

    (SW Maestro) Hadoop September 17, 2012 9 / 47

  • TF-IDF Algorithm

    , .

    t ( )D

    nt,d t d

    |D|

    (SW Maestro) Hadoop September 17, 2012 10 / 47

  • TF-IDF AlgorithmTerm Frequency .

    t ft,d = nt,d

    Inverse Document Frequency .

    id ft,d =1

    |{d : t d D}|+ 1 TF IDF , t D, d TF-IDF .

    t f id ft,d,D = t ft,d id ft,d(t d D) (SW Maestro) Hadoop September 17, 2012 11 / 47

  • Enhanced TF-IDF

    TF-IDF .

    1 , . TF .

    2 1000 1 A 2 B ?

    (SW Maestro) Hadoop September 17, 2012 12 / 47

  • Enhanced TF-IDF

    TF-IDF .

    t ft,d =

    1+ ln(nt,d) if nt,d > 00 if nt,d = 0id ft,d = ln(

    |D||{d : t d D}|+ 1)

    (SW Maestro) Hadoop September 17, 2012 13 / 47

  • Example

    t health .id ft,d = ln(

    42) = 0.6931

    ni,d nt,d t ft,d t f id f

    d1 Health is a necessary condi-tion for happiness.

    7 1 0.134 0.093

    d2 It is the business of the po-lice to protect the commu-nity.

    11 0 0 0

    (SW Maestro) Hadoop September 17, 2012 14 / 47

  • Example

    ni,d nt,d t ft,d t f id f

    d3 The city health business de-partment runs several freeclinics for health profession-als throughout the year.

    15 2 0.13 0.087

    d4 That plane crash was a ter-rible business.

    7 0 0 0

    , health TF-IDF (d1, d3) .

    (SW Maestro) Hadoop September 17, 2012 15 / 47

  • Section 4

    (SW Maestro) Hadoop September 17, 2012 16 / 47

  • Vector Space Model

    .

    Vector , Vector (Dimension) .

    d VSM .

    Vd = [w1,d ,w2,d , . . . ,wN ,d]T

    , wt,d .

    wt,d = t f id ft,d,D = t ft,d id ft,d

    (SW Maestro) Hadoop September 17, 2012 17 / 47

  • Cosine Similarity

    6

    -~q

    ~d1

    :

    ~d2

    Figure :

    ~q . cos .

    cos=~d1 ~q| ~d1||~q|

    , Cosine Similarity .

    (SW Maestro) Hadoop September 17, 2012 18 / 47

  • , .

    1 .

    2 .

    3 Cosine Similarity .

    4 Similarity .

    (SW Maestro) Hadoop September 17, 2012 19 / 47

  • Section 5

    (SW Maestro) Hadoop September 17, 2012 20 / 47

  • Subsection 1

    TF-IDF()

    (SW Maestro) Hadoop September 17, 2012 21 / 47

  • Flow Diagram

    MapReduce Flow , Flow Diagram .

    - HDFS .

    - HDFS TextFile .

    - .

    (SW Maestro) Hadoop September 17, 2012 22 / 47

  • TF-IDF Data Flow Diagram

    Flow BDocument Term

    Index

    Flow CCalculate TF

    Flow DCalculate DF

    Document MySQL

    Flow ATerm Document

    IndexMySQL

    MySQL

    MySQL

    .

    TD, DT

    TF, DF

    (SW Maestro) Hadoop September 17, 2012 23 / 47

  • Flow A. Term-Document Index

    Document

    Document

    Noun Extracter

    Noun Extracter

    Term Document Indexer

    MySQL(TD Index)

    ID: 13, "

    ."

    ID: 14, " OS X

    ."

    ["","","","",""]

    ["","OS","X","","",""]

    Mapper Reducer

    MapReduce Job

    (SW Maestro) Hadoop September 17, 2012 24 / 47

  • Flow B. Document-Term Index

    Document

    Document

    Noun Extracter

    Noun Extracter

    Document Term Indexer

    MySQL(DT Index)

    ID: 13, "

    ."

    ID: 14, " OS X

    ."

    ["","","","",""]

    ["","OS","X","","",""]

    Mapper

    MapReduce Job

    (SW Maestro) Hadoop September 17, 2012 25 / 47

  • Flow C. Term Frequency

    Document

    Document

    Noun Extracter

    Noun Extracter

    Term Frequency Counter

    MySQL(TF)

    ID: 15, "

    ."

    ID: 27, "OmmiGraffle 99 ."

    ["", "", "", "", ""]

    ["OmmiGraffle", "", "", "99", "",

    ""]

    Mapper Combiner

    MapReduce Job

    WordCount .

    (SW Maestro) Hadoop September 17, 2012 26 / 47

  • Flow D. Document Frequency

    MySQL(TD Index)

    Document Frequency Counter

    MySQL(DF)

    SQL Query

    IDF DF

    DocumentCount .

    (SW Maestro) Hadoop September 17, 2012 27 / 47

  • Subsection 2

    (SW Maestro) Hadoop September 17, 2012 28 / 47

  • Data Flow Diagram

    Flow AVectorize

    Flow BList Preload

    Query(User Input)

    MySQL

    Flow CScoring

    MySQL(Temporary)

    Flow DSorting and Paging

    Search Result

    (Query)

    (SW Maestro) Hadoop September 17, 2012 29 / 47

  • Flow A. Vectorize

    Query(User Input)

    Noun ExtracterTerm Frequency

    CounterNext Flow

    " " ["", "", ""] , ,

    VSM

    Term Frequency .

    (SW Maestro) Hadoop September 17, 2012 30 / 47

  • Flow B. List Preload

    Query Vector

    Merge document list contain terms in query vector

    MySQL

    Load Document Vector Information

    .

    , TF 300 .

    (SW Maestro) Hadoop September 17, 2012 31 / 47

  • Flow C. Scoring

    Query Vector

    Load Document Frequency

    MySQL

    Loaded Document Vector

    Scoring TF-IDF

    , ,

    , ,

    , ,

    .

    Cosine-Similarity .

    (SW Maestro) Hadoop September 17, 2012 32 / 47

  • Flow D. Sorting and Paging

    Presorted TF-IDF Scores

    , ,

    .

    Sorting Sorted Data

    ,,

    , .

    .

    (SW Maestro) Hadoop September 17, 2012 33 / 47

  • Section 6

    (SW Maestro) Hadoop September 17, 2012 34 / 47

  • SKT T cloud biz 4

    1 : 1 Vcore, 2GB RAM, 40GB HDD, CentOS 5.5 64bit

    Sun Java 1.6.0_35

    Apache Hadoop 1.0.3 IP

    Hadoop1: 1.234.45.90 (Namenode, Secondary Namenode) Hadoop2: 1.234.45.94 (Datanode) Hadoop3: 1.234.62.102 (Datanode) Hadoop4: 1.234.62.101 (Datanode)

    (SW Maestro) Hadoop September 17, 2012 35 / 47

  • Hadoop1 (1.234.45.90) ssh ., HDFS/chiwanpark/memento-input .

    > hadoop jar memento-engine-0.1-SNAPSHOT.jarcom.chiwanpark.memento.mapreduce.WorkRunner

    .

    (SW Maestro) Hadoop September 17, 2012 36 / 47

  • (SW Maestro) Hadoop September 17, 2012 37 / 47

  • Hadoop1 ssh . > java -classpathmemento-engine-0.1-SNAPSHOT.jar:/opt/hadoop/confcom.chiwanpark.memento.searcher.cli.SearchRunner query ""

    id TF-IDF Score . ID HDFS .

    > hadoop fs -cat /chiwanpark/memento-input/e02f5b1df830e8fcf89df333dc2dd642a9f0569ee6aea26cc1e3ec3a22e4b988bfadb397c1ba7bd593feb5bd99276b9ce15a84741b5fe583d1dc2cb9110ae70c.txt

    (SW Maestro) Hadoop September 17, 2012 38 / 47

  • (SW Maestro) Hadoop September 17, 2012 39 / 47

  • (SW Maestro) Hadoop September 17, 2012 40 / 47

  • Section 7

    (SW Maestro) Hadoop September 17, 2012 41 / 47

  • Subsection 1

    (SW Maestro) Hadoop September 17, 2012 42 / 47

  • MapReduce , .

    TF-IDF Lucene Lucene Score TF-IDF Score .

    (SW Maestro) Hadoop September 17, 2012 43 / 47

  • Test1 Job1 - 102 /3 58 ( ) Job2 - 102 /3 43 ( ) 0.22

    Test2 Job1 - 99 /3 54 ( ) Job2 - 99 /4 4 ( ) 0.21

    (SW Maestro) Hadoop September 17, 2012 44 / 47

  • Test3 Job1 230 /8 44 ( ) Job2 230 /8 16 ( ) 0.22

    Test4 Job1 1862 /1 3 55 ( ) Job2 1862 /1 4 27 ( ) 0.24

    (SW Maestro) Hadoop September 17, 2012 45 / 47

  • Subsection 2

    (SW Maestro) Hadoop September 17, 2012 46 / 47

  • ,

    , .

    .

    (SW Maestro) Hadoop September 17, 2012 47 / 47

  • , Hadoop File Split Mapper . , Single line Split .

    Cloud System 4 , VM I/O . VM .

    (SW Maestro) Hadoop September 17, 2012 48 / 47

  • 0

    / 1

    TTA

  • 1

    ,

    .

    .

    ([email protected])

  • 2

    . .

    We also want to challenge industry, research

    universities, and nonprofits to join with the

    administration to make the most of the

    opportunities created by BIG DATA We need what the president calls an

    all hands on deck effort. Tom Kalil (OSTP)

  • 3

    ( ?)

    ,

    2012

    : ??

  • 4

    IBM

    2012 CEO

    IBM ,

    PC

  • 5

    IBM CEO

    60 100 CEO

    One of the most profound things

    they talk about is

    data will separate the winners and losers in every single industry.

    CEO

    ??

  • 6

    BIG DATA ( )

    /

    / New Revolution

  • 7

    ?

    BIG : (volume) -

    Gartner 3V = Volume + Variety + Velocity

  • 8

    HDD (1980~2010)

  • 9

  • 10

    IT

    ,

    Hadoop :

    Amazon Web Service

  • 11

    ,

    ,

    ,

    Definition (Broad sense):

  • 12

    3V

    , ,

    , , ,

  • 13

    /

    ,

    (context-based service)

  • 14

    PC

    ??

    ?

  • 15

    ()

    () . Tim OReilly

  • 16

  • 17

    10

    10

  • 18

    Occupy BIG DATA!

  • 19

    , -

    - 1/3 10TB

    BIG DATA

    BIG DATA TECH

    ,

  • 20

    -

    , ()

    -

    ,

  • 21

    [] (sensing)

  • 22

    The Santa Cruz Experiment

    :

    2011 7 1 27%

  • 23

    /

    , ,

  • 24

    -

    ? 10

    LTE 1

  • 25

    ( CEO )

  • 26

    ()

    ,

    100

  • 27

    BIG Data = Big Brother?

    Privacy

    /

    vs.

    , ,

  • 28

    , ?

    ,

  • 29

    1

    ,

    ,

    ICT

  • 30

    , ,

  • 31

    Tim Berners-Lee Nigel Shadbolt

    2011

  • 32

    ~2010 2011 2012 2013 2014 2015 2016~

    (IoT)

    / , SNS

    DATA

    MPP DWH - PB

    MPP DWH

    Stock

    + Flow

    (POS/ ) ,

    (SNS ) ,

    Stock/Flow

    : (2011).

  • 33

    2013 (10/50%)

    2013 (4/20%)

    *

    WHY?

    and

    1

  • 34

    ,

    1

    , Go or Stop? []

    ICT

    /

    Slope of Enlightenment

    2012

    2013

    2015~6

    2016~7

    2018

  • 35

    --

    8

    10

  • Big Data

    October 18, 2012

  • 2012 SAP AG. All rights reserved. 2

    Agenda

    1. Big Data

    2. Big Data Technology Outlook

    3. Big Data

    4. SAP Big Data SAP Big Data Framework

    5.

  • Big Data

  • 2012 SAP AG. All rights reserved. 4

    Big Data Gartner, IDC

    , .

    (Critical Mass)

    Big Data

    Mobile Device (Smart Device)

    Cloud Service

    Social Media

    Big Data 3

    Cloud Computing

    Real Time

    Network

    Big Data

    E-mail: 290

    : 375 MB

    Youtube :

    20

    Google :

    240 MB

    twitter : 5,000

    Facebook :

    7,000

    Mobile Internet :

    1.3 MB

    Amazon :

    72.9 GOOD & Munday, 2011 the world of Data

  • 2012 SAP AG. All rights reserved. 5

    Big Data 2012 9

    Aberdeen presents a baseline of current "Big Data" initiatives and highlights some of the most attention-grabbing strategies and solutions.

    Surprisingly, 93% of companies surveyed listed structured data as key to their "Big Data" efforts, followed by the more typical sources such as social media and customer sentiment data.

    Predictive analytics features prominently in "Big Data's" future, but about three out of five companies polled also cited mobile BI and in-memory computing as technologies they will be investing in within the next two years.

  • 2012 SAP AG. All rights reserved. 6

    Big Data 2012 9

    Source: Aberdeen Group, January 2012

    1: Drivers for Fast, Streamlined Analysis of More Data

    47% 1

    35% Real Time Near Real Time

    71% , 3 1

    : 150 TB

    17% 1 PB

    42% , 1/5 75%

    23%

    47%

    : 14, 9, 5 Big Data Enterprise

    Big Data , Active Business Data 5 TB 99

    , ;

    Dark Data

    Velocity

  • 2012 SAP AG. All rights reserved. 7

    Big Data 2012 9

    Big Data .

    Big Data , , 93% Big Data ( )

    : High Volume, High Velocity, Internet generated source Click Stream, Social Media, customer sentiment data

    , ,

    ,

    , ,

    Human Resource , Location & Geo-spatial

    Digital Media

    Machine to Machine (M2M), Sensor

    ,

    : (Doc, PPT, XLS), e-Mail

    2: Sources that feed Big Data

    Source: Aberdeen Group, January 2012

    Big Data Enterprise

    Big Data , Active Business Data 5 TB 99

  • 2012 SAP AG. All rights reserved. 8

    Big Data 2012 9

    Currently Use

    Plan to Use

    Predictive Analytics Big Data , Big Data

    3: The Technological Wave of the Future Big Data

    Source: Aberdeen Group, January 2012

    Big Data Enterprise

    Big Data , Active Business Data 5 TB 99

    Big Data High Volume

    MPP: cluster computing

    Columnar DB:

    Real time Integration Tools: / Stream

    BI Mobile BI

    In-Memory Computing

    , Commodity

  • 2012 SAP AG. All rights reserved. 9

    Big Data 2012 9

    1: Unique Data Source Used for Business Analysis

    Source: Aberdeen Group, January 2012

    2: The Top Processes Driving Data Management Initiative

    Source: Aberdeen Group, January 2012

    ,

    12 : 38%

    3 2.5

    (EDW, DM, Application, Unstructured, Social Data)

    ,

    , , ,

    Volume Velocity

    Dark Data

    Variety / Complexity

  • 2012 SAP AG. All rights reserved. 10

    Big Data 2012 9

    3: Top Strategic Actions to Support Data Management

    Source: Aberdeen Group, January 2012

    4: Who Owns Data Management / Government

    Source: Aberdeen Group, January 2012

    Big Data IT

    IT , .

    Big Data

  • Big Data Technology Outlook

  • 2012 SAP AG. All rights reserved. 12

    Big Data Eco-System

    NoSQL

    Data .

    /

    Hadoop

    Apache Open source project

    Map/Reduce: , Web logs, text data, graph data.

    Hbase:

    Hive: , , DW

    Commercial support Cloudera, HortonWorks, IBM, & EMC/Greenplum.

    R Language

    Open Source

  • 2012 SAP AG. All rights reserved. 13

    Big Data Hype Cycle, 2012

    Figure 1. Hype Cycle for Big Data, 2012

  • 2012 SAP AG. All rights reserved. 14

    Big Data Priority Matrix, 2012

    Less than 2 years 2 to 5 years 5 to 10 years More than 10 years

    Transformational Column Store DBMS Cloud Computing In-Memory Database

    Management Systems

    Complex-Event Processing Content Analytics Context-Enriched Services Hybrid Cloud Computing Information Capabilities

    Framework Telematics

    Information Valuation Internet of Things

    High Predictive Analytics Advanced Fraud Detection and Analysis Technologies

    Cloud-Based Grid Computing Data Scientist In-Memory Analytics In-Memory Data Grids Open Government Data Predictive Modeling Solutions Social Analytics Social Content Text Analytics

    Cloud Parallel Processing High-Performance Message

    Infrastructure IT Service Root Cause

    Analysis Tools Logical Data Warehouse Sales Analytics Search-Based Data Discovery

    Tools Social Network Analysis

    Semantic Web

    Moderate Social Media Monitors Web Analytics

    Activity Streams Claims Analytics Database Platform as a

    Service (dbPaaS) Database Software as a

    Service (dbSaaS) Intelligent Electronic Devices MapReduce and Alternatives noSQL Database Management

    Systems Speech Recognition Web Experience Analytics

    Cloud Collaboration Services Dynamic Data Masking Geographic Information

    Systems for Mapping, Visualization and Analytics

    Open SCADA Video Search

    Low

    Years to mainstream adoption

  • Big Data

  • 2012 SAP AG. All rights reserved. 16

    11 Industry Big Data Opportunity Heat Map

    Big Data .

    Volume, Velocity, Variety

    Hardware, Software, Service

  • 2012 SAP AG. All rights reserved. 17

    Big Data AS-IS

    ERP/CRM/SCM/PLM/MES

    +

    / :

    : High

    ACID :

    Data Governance : High

    DW/eDW/DM/RMS/BI

    +

    / :

    : Middle

    ACID :

    Data Governance : High

    ECM/EDMS/KMS/ILM

    +

    / :

    : High

    ACID :

    Data Governance : Middle

    Blog/Facebook/Twitter/Log

    / :

    : Low

    ACID :

    Data Governance : Low

    , , , , ACID , Data Governance, ,

    Business Social Media

  • 2012 SAP AG. All rights reserved. 18

    Big Data AS-IS : AS-IS

    () ()

    / Dot Com

    : 162

    Dark Data Big Data .

    ** Dark Data , , ,

    Source: Gartner, July 2012 [Dark Data Represents the Most Immediate Opportunity to Leverage Big Data]

  • 2012 SAP AG. All rights reserved. 19

    Big Data AS-IS : Big Data Market Big Data Big Data

    Business Big Data ( ) Market Big Data (Portal )

    +

    +

    , , ACID (Atomicity/Consistency/Isolation/Durability ) - , , , - , ,

    , ACID CAP (Consistency / Availability / Partition Tolerance 2 )

    Real Time Time Latency

    Fact Past , Future

    BI Tool Tool Open Source

    Data Scientists, Experts

    RDBMS SQL Open Source

    Open Source Platform NoSQL Map/Reduce + Hadoop

    * Open Source

    * / ,

  • 2012 SAP AG. All rights reserved. 20

    Big Data

    /

    Cloud Digital Prototyping &Testing On demand Cloud

    branch /Self

    (Trading, , Processing)

    Trading //

    ICT Content

    Content /Social

    Content

    Tracking

    /

    /

    Processing

    Booz&Company (2011) the next wave of digitization setting your direction, Building your capabilities

  • 2012 SAP AG. All rights reserved. 21

    Big Data Best Practices -

    Big Data Best Practice

    , , IT , ,

    .

    o Hadoop Big Data

    o Hadoop DW

    o MapReduce Hadoop

    Big Data

    , off line

    [Gartner 12 dimension model for Big Data]

  • 2012 SAP AG. All rights reserved. 22

    Big Data : Open Source Big Data

    Data

    o Commodity System VS Enterprise System

    Hadoop (HDFS) Batch Processing

    o ,

    Big Data BI tool

    Skill Set

    o Hadoop, Data Scientist, NoSQL, Map/Reduce, R Language

    Big Data Back Up

    Big Data Data Governance / Compliance

    Big Data ( , )

    HDFS

    Name Node

    (stores metadata)

    Data Node

    (stores actual data in blocks)

    Data Node

    (stores actual data in blocks) replication

    client

    HDFS MapRedu

    ce HDFS

    Input process output

  • SAP Big Data SAP Big Data Framework

  • 2012 SAP AG. All rights reserved. 24

    Big Data 3V (Velocity, Volume, Variety)

    CRM data

    GP

    S

    Demand

    Spee

    d

    Velocity

    Transactions

    Op

    po

    rtu

    nit

    ies

    Service Calls

    Customer

    Sales orders

    Inventory

    E-m

    ails

    Twee

    ts

    Planning

    Things

    Mobile

    Instan

    t messages

    Velocity 18 2 ,

    IDC

    Volume 2005 150 Exabyte, 2011 1,200 Exabyte

    The Economist

    Variety 80 % ( + )

    Gartner

  • 2012 SAP AG. All rights reserved. 25

    Variety 80 % ( + )

    Gartner

    CRM data

    GP

    S

    Demand

    Spee

    d

    Velocity

    Transactions

    Op

    po

    rtu

    nit

    ies

    Service Calls

    Customer

    Sales orders

    Inventory

    E-m

    ails

    Twee

    ts

    Planning

    Things

    Mobile

    Instan

    t messages

    Volume 2005 150 Exabyte, 2011 1,200 Exabyte

    The Economist

    SAP Big Data Framework (Velocity, Volume, Variety)

    Velocity 18 2 ,

    IDC

    SAP Sybase ESP Complex Event Processing Engine

    Real Time Analytic

    Query than Data, not Data than Query

    SAP HANA In Memory Computing Engine

    In Memory Appliance

    In Memory Analytic

    Up to 1,000 times faster

    SAP Sybase IQ Smarter Analytic engine

    The 1st Columnar DBMS

    Open Platform

    In Database Analytic

    :

    Now-casting

  • 2012 SAP AG. All rights reserved. 26

    Variety 80 % ( + )

    Gartner

    Velocity 18 2 ,

    IDC

    CRM data

    GP

    S

    Demand

    Spee

    d

    Velocity

    Transactions

    Op

    po

    rtu

    nit

    ies

    Service Calls

    Customer

    Sales orders

    Inventory

    E-m

    ails

    Twee

    ts

    Planning

    Things

    Mobile

    Instan

    t messages

    Volume 2005 150 Exabyte, 2011 1,200 Exabyte

    The Economist

    SAP Big Data Framework (Velocity, Volume, Variety)

    SAP Sybase IQ Smarter Analytic engine

    Multiplex Grid Architecture

    No Volume Limitation The Largest EDW Platform

    SAP HANA In Memory Computing Engine

    In Memory Appliance

    Up to 100 node scale out Capacity

    ->

  • 2012 SAP AG. All rights reserved. 27

    Velocity 18 2 ,

    IDC

    CRM data

    GP

    S

    Demand

    Spee

    d

    Velocity

    Transactions

    Op

    po

    rtu

    nit

    ies

    Service Calls

    Customer

    Sales orders

    Inventory

    E-m

    ails

    Twee

    ts

    Planning

    Things

    Mobile

    Instan

    t messages

    Volume 2005 150 Exabyte, 2011 1,200 Exabyte

    The Economist

    Variety 80 % ( + )

    Gartner

    SAP Big Data Framework (Velocity, Volume, Variety)

    SAP Sybase IQ Smarter Analytic engine

    Unstructured Data Management

    Hadoop Integration

    SAP HANA In Memory Computing Engine

    Text Analytic Engine

    R embedded

  • 2012 SAP AG. All rights reserved. 28

    Ingest Store Process Present

    Effo

    rt

    Effo

    rt

    /

    Extract-Transform-Load

    Event Stream Processing

    ACID

    SQL/OLAP

    DB UDF

    DB DFS

    Low-latency

    ,

    (DFS)

    BASE

    BI

    Map/Reduce ,

    SQL

    Connectivity SQL

    High-latency

    Big Data

  • 2012 SAP AG. All rights reserved. 29

    SAP Real-time Analytics

    SAP Big Data Processing Framework

    Hadoop

    Smart Meter

    ,

    Big Data ad-hoc

    Big Data streaming

    Big Data

    , ,

    &

  • 2012 SAP AG. All rights reserved. 30

    SAP BusinessObjects BI solutions

    Transaction Processing

    DB Engine

    In-memory Computing Engine

    DB Engine

    Analytic Grid

    DB Engine

    MapReduce Batch Compute Framework

    Sybase Replication Server, SAP BusinessObjects Data Services (Integrate / synchronize data across deployment options)

    Sybase ESP Stream & event

    processing

    SAP Big Data Processing Framework

    SAP HANA Sybase IQ

    Sybase ESP Monitor / filter

    streaming events

    Semi-structured Data Structured Data Unstructured Data

    Hadoop Sybase ASE

    Hive/HDFS

    SAP Big Data Framework :

    ,

    1) , 2) , 3)

    Inge

    st

    Sto

    re

    Pro

    cess

    P

    rese

    nt

    ( )

    Targeting

  • 2012 SAP AG. All rights reserved. 31

    Hadoop Distributions | OS + Hardware | Map-Reduce (M/R) Support

    Reporting / Analytics

    Reporting / Analytics

    Reporting / Analytics

    EDW ETL / Push Down Transformations

    ETL / Move

    Scheduled reports

    Data Mart Data Warehouse

    Big Data EDW Streaming Real-Time Analytics

    M/R Analytics

    M/R Analytics

    M/R Analytics

    HADOOP HADOOP HADOOP

    CEP

    Hadoop Big Data

  • 2012 SAP AG. All rights reserved. 32

    : Mitsui Knowledge Industry Healthcare industry Cancer cell genomic analysis

    : Real-time Big data (R + Hadoop + HANA)

    Mitsui IT

    , , Big Data , : 1,990

    :

    1 1 TB DNA Sequence Matching

    :

    2 3 . HANA MKI 15 , 216

    : DNA

    :

    : ,

    Generate Reports

    Generate Reports

    Generate Reports

    HANA

    Hadoop

    Hadoop-HANA Connector

    Variant Calling With samtool

    More Analysis with R packages

    R Integration Predictive Analysis

    Library

    Preprocess Data Analysis Annotation

    : 2~3 -

    : 2~3 ( )

    : 20~40 - SAP HANA & Apache Hadoop

    Manual tasks Computational tasks

  • 2012 SAP AG. All rights reserved. 33

    : T Mobile USA

    : SAP HANA + SAP Business Object + DW

    2011 ( 2 1 )

    ,

    ( 9 2 )

    :

    50 - 60

    18 (Teradata)

    5.5 , 60

    2 1

    . ,

    Company T-Mobile USA Headquarters Bellevue, Washington Industry Telecommunications Products and Services Mobile telephone service Employees 36,000 worldwide Revenue US$20.6 billion

    50x improvement in the performance of analytics: We can recalibrate offers in the market place in one day that took a week using our existing solutions.

    Erez Yarkoni,

    T-Mobile CIO

  • 2012 SAP AG. All rights reserved. 34

    SAP Big Data Value SAP HANA Real Time Big Data

    Big Data

    Big Data

    Billing

    CDR

    Real Time Replication Pre-processing

    In DB Mining Real Time BI Market Big Data Business Big Data

    Integrated Analytics on SAP HANA

  • 2012 SAP AG. All rights reserved. 35

    Big Data SAPs Value

    Higher Performance

    Higher Speed

    More Data

    Better Capability

    SAPs Advanced Value

    Business

    Social Media

    Hadoop

  • 2012 SAP AG. All rights reserved. 36

    Big Data Big Data

    SAP Big Data Framework Big Data Value

    Volume + Variety

    Volume + Velocity

    Hadoop batch pattern analysis

    SAP real-time analytical

    processing

    SAP Big Data

    Value

    , , SAP Big Data

    ,

    Big Data

  • !

    SAP D&T

  • -

    l HANA l Database & Technology l SAP Korea

    : SAP HANA

  • 2012 SAP Korea All rights reserved. 2

    1. In-memory Computing ?

    2. SAP In-memory Technologies

    3. -

    4. Roadmap

  • 2012 SAP Korea All rights reserved. 3

    In-memory Computing ?

  • 2012 SAP Korea All rights reserved. 4

    IMC(In-Memory Computing)

    Big Data :

    Mobile :

    RTE, Cloud, SaaS

    x86 64bit multi-cores

    DRAM $10 / GB NAND Flash $1 / GB

    - by Gartner : Top 10 Strategic Technology Trends, 2012 Feb

    ~100ns

    >1Mns

    +

    IT Readiness

    S/W (IMDB)

    +

  • 2012 SAP Korea All rights reserved. 5

    IMC

    2012, 70% Global 1000 BI , .

    - Tipping Point 2013 .

    2016 - - DBMS 25% DW (OLTP) .

    Big Data 93% DBMS 63% In-Memory Computing, 50% Columnar DB, 50% Hadoop .

    Oct 2011

    Oct 2006

    Jan 2012

    Feb 2012

    ()

    ~

  • 2012 SAP Korea All rights reserved. 6

    -

    - SAP .

    .

    .

    .

    .

    1990 .

    .

  • 2012 SAP Korea All rights reserved. 7

    - IT

  • 2012 SAP Korea All rights reserved. 8

    SAP IMC Technologies

  • 2012 SAP Korea All rights reserved. 9

    SAP In-Memory Computing Evolution

    Object Store

    APO In-memory Object Cache

    2000 Object Store

    Column Store

    In-memory Text Search Column Index

    2001

    Object Store

    Column Store

    Row Store SQL

    OLTP

    Row Store IMDB 2005 SAP

    2002

    Object Store

    Column Store

    Row Store SQL

    OLTP

    MPP Appliance

    BW In-Memory MPP Appliance

    2006

    SAP HANA In-Memory Database

    Row & Column Store OLTP OLAP

    H/W Appliance 2011

  • 2012 SAP Korea All rights reserved. 10

    In-Memory DB : SAP HANA

  • 2012 SAP Korea All rights reserved. 11

    : Disk-based vs Memory-based

    Data Block Memory Cache

    Database ( 10 TB)

    Conventional RDBMS

    Disk I/O

    Memory (128 GB)

    Memory

    Data Volume Log Volume

    All Data Sets

    Persistent Storage

    SAP HANA

    Data Modeling

    ( Page)

    Database

    Disk Database

    (100TB+)

  • 2012 SAP Korea All rights reserved. 12

    SAP HANA

    Synergy : In-memory + Columnar + MPP

    HANA

    DW

    + 5,000

    > 1,000

    SAP HANA

    Row

    ,

    Column

    1/10

  • 2012 SAP Korea All rights reserved. 13

    In-Memory MPP DB

    Disk-basedMPP

    In-

    memoryMPP

    MPP

    SMP

  • 2012 SAP Korea All rights reserved. 14

    Latency

  • 2012 SAP Korea All rights reserved. 15

    ()

    With HANA

    Without HANA

  • 2012 SAP Korea All rights reserved. 16

    Stand-by Fail-over

    100TB = SAP 8

    Petabyte

    HANA -

  • 2012 SAP Korea All rights reserved. 17

    Batch Processing

    Intraday+

    Very Large 1 PB+

    Ad-Hoc Predictive

    HADOOP

    Event Driven

    Transactional

    Processing EDW

    Operational Data Store

    Multi-Dimensional

    OLAP

    Real-Time Real-Time Intraday+ Intra-hour Intraday+

    Small < 1GB

    Small < 1GB

    Large 1 TB+

    Medium 100 GB+

    Medium 100 GB+

    Eventing Parametrized Parametrized Parametrized Ad-Hoc

    Predictive Analysis

    Data Volume

    Latency

    Event Insight

    Sybase ASE

    Sybase IQ

    HANA

    Drive Insights into Structured Data Analytics Framework

    +

    HANA -

  • 2012 SAP Korea All rights reserved. 18

    DBMS vs Hadoop

  • 2012 SAP Korea All rights reserved. 19

    SAP HANA

    . /

    .

    . , R

    .

  • 2012 SAP Korea All rights reserved. 20

    7

    BI

    / / SI/SM/

    SAP

    HANA

    ODBC

    JDBC

  • 2012 SAP Korea All rights reserved. 21

    HANA -

  • 2012 SAP Korea All rights reserved. 22

    Readiness

    3rd party

    3rd party backup tools - IBM Tivoli, HP Data Protector, Symantec Netbackup etc.

    3rd party monitoring tools - IBM Tivoli, HP Service Guard etc. (In preparation)

    (HA)

    Stand-by Node/System

    Disaster Tolerance

    HANA Instance Failover.

    Automatic and manual procedures possible

    &

    Full Data Backup

    Log Backup

    Disaster Recovery

    (Bare Metal Restore)

    Data Center Readiness

    SAP HANA

    Available today Available today Available today Available soon In preparation

    & Administration

    SAP Solution Manager End to End monitoring/ alerting/ scheduling

    Security & Auditing

  • 2012 SAP Korea All rights reserved. 23

    SAP HANA

    Memory

    Persistence Storage

    Log Volume

    (SSD)

    Data Volume

    (SSD, High-speed SAS)

    [ Persistency Layer] [Scale-out HA] [Disaster Tolerance,

    Warm stand-by]

  • 2012 SAP Korea All rights reserved. 24

    HANA vs DW Appliance ?

    +

  • 2012 SAP Korea All rights reserved. 25

    Exadata 3 vs SAP HANA

  • 2012 SAP Korea All rights reserved. 26

    -

  • 2012 SAP Korea All rights reserved. 27

    -

    Go deep

    Go broad

    In Real-time

    with High-speed

    w/o pre-fabrication

    ,,

    ,

    //

  • 2012 SAP Korea All rights reserved. 28

    - :

    1 600+ , 200+

    HANA HANA

    1 10+

    1.5 30+

    1 10

    => IT .

    , , , , .

  • 2012 SAP Korea All rights reserved. 29

    2012 86

  • 2012 SAP Korea All rights reserved. 30

    ,

    270

    , DB

  • 2012 SAP Korea All rights reserved. 31

    Manufacturer

    Computing Engine

    Machine Owner/Operator

    Dealer (option: Delivered via CRM portal)

    Manufacturer

    Real Time

    Equipment data Engine temp Oil pressure RPM CO2 Defect codes Speed Etc.

    HANA

    HANA DB

    , ,

    > >

  • 2012 SAP Korea All rights reserved. 32

    60 times faster

    HANA DB R . SAS .

  • 2012 SAP Korea All rights reserved. 33

    408,000x faster than traditional disk-

    based systems in

    technical PoC

    216 (DNA): 2-3 -> 20

  • 2012 SAP Korea All rights reserved. 34

    Transforming information into intelligence in real time is a cornerstone for McLarens winning formula and increasingly critical for the future of every company, Jim Hagemann Snabe, co-CEO, SAP AG

    "Using HANA we can hopefully automate decision making. People have always made decisions based on the data, but we want to get to the point

    where the system can make the decision, Stuart Birrell , McLaren CIO

    14,000 : 5 -> 1

    99% predict the outcome of a race

    5,000 events per second loaded onto

    SAP HANA

    (not possible before)

    10-30%

    Interactive data analysis leading to

    improved design

    thinking and game

    planning

    1,000x faster tumor data analyzed in

    seconds instead of

    hours

    :

    2-10 seconds for report execution

  • 2012 SAP Korea All rights reserved. 35

    Transforming information into intelligence in real time is a cornerstone for McLarens winning formula and increasingly critical for the future of every company, Jim Hagemann Snabe, co-CEO, SAP AG

    "Using HANA we can hopefully automate decision making. People have always made decisions based on the data, but we want to get to the point

    where the system can make the decision, Stuart Birrell , McLaren CIO

    McLaren Group Limited Automotive Industry (Formula One) Predict and Transform the outcome of races

    Telemetry

    .

    .

    .

    99%

    14,000 : 5 -> 1

  • 2012 SAP Korea All rights reserved. 36

    McLaren Case Study

  • 2012 SAP Korea All rights reserved. 37

    McLaren Case Study

  • 2012 SAP Korea All rights reserved. 38

    McLaren Case Study

  • 2012 SAP Korea All rights reserved. 39

    McLaren Case Study

  • 2012 SAP Korea All rights reserved. 40

    McLaren Case Study

  • 2012 SAP Korea All rights reserved. 41

    McLaren Case Study

  • 2012 SAP Korea All rights reserved. 42

    3

    95% reduction in data load time 2 minutes in

    BW HANA Vs. 35-40 min

    in BW Oracle

    266x faster query response time with 15x

    average

    / : (BW/Oracle) 15 (BW/HANA)

    /

    2.5x faster reporting with sub-optimized

    queries - from 28.54 sec.

    to 11.38 sec.

    453.7 : 1787.49 -> 3.94

    70% saving on storage space with

    data compressed to

    30%

    1,000 : 77 -> 13

    60% improvement in data load time

    4-10 times faster DSO activation

    (2)

  • 2012 SAP Korea All rights reserved. 43

    Co-PA was the most interesting thing to look at in the first step. We saw response times reduce from about 620 seconds to about five seconds in one

    case. Andrew Pike, (former) CIO

    124x faster analytics - drilldown by alphacode -

    from 620 sec. to 5 sec.

    37x faster cost allocation drilldown by

    sending cost center -

    from 260 sec. to 7 sec.

    40x faster reporting Runtime reading line

    items for EBIT with

    commodity sales - from

    260 sec. to 7 sec.

    9x faster cost allocation initial report -

    from 45 sec. to 5 sec.

    355x faster data analysis; from 77 minutes

    to 13 seconds

    8 weeks rapid, non-disruptive

    implementation

    2x data compression

    60x faster SKU/Month reporting; from 120 sec

    to 2 sec

    : , /

  • 2012 SAP Korea All rights reserved. 44

    SAP HANA Roadmap

  • 2012 SAP Korea All rights reserved. 45

    4 HANA

  • 2012 SAP Korea All rights reserved. 46

    SAP BPC ( )

    SAP Finance and Controlling Accelerator

    SAP Smart Meter Analytics

    SAP Sales Pipeline Analysis

    SAP Predictive Analytics

    SAP Customer Segmentation Accelerator

    SAP HANA Platform

    SAP Business Warehouse

    SAP BusinessObjects BI

    SAP CO-PA ( )

    SAP B1 ( ERP)

    Third Party Apps

    SAP ERP

    Today

    New Cloud Apps

    New Mobile Apps

    SAP Planning for Retail

    SAP Customer Value Intelligence

    SAP Predictive Segmentation

    SAP Sales & Operations Planning

    SAP Account Intelligence

    SAP Demand Signal Management

    SAP Account Intelligence

    SAP Liquidity Risk Management ( )

    SAP Customer Energy Mgmt.

    SAP Trade Promotion Mgmt

    Future

    HANA

  • 2012 SAP Korea All rights reserved. 47

    Legacy ODS EDW Data Marts BI/Report Mart

    /

    /

    BI/

    Legacy ODS EDW Data Marts BI/Report Mart

    Oracle

    (=)

    SAP

    ()

    Legacy ODS EDW Data Marts BI/Report Mart

    SAP

    ()

    Sybase ASE

    /

    Teradata Exadata Exadata Exalytics

    + Sybase ASIQ

  • 2012 SAP Korea All rights reserved. 48

  • 2012 SAP Korea All rights reserved. 49

    SAP HANA DB

    ERP , , Backflushing

    , (Mobile BI) Time Gap (Predictive Analysis)

    SAP HANA with Sensor Technology, Mobile, Big-Data, Social Data, etc , ,

  • 2012 SAP Korea All rights reserved. 50

    -

    - ,- BI -

  • . Email: [email protected]

  • Case Study

    2012 10 18

    621 C 5

    Tel: 02-6246-1400 http://www.wise.co.kr

    TTA

    , [email protected]

  • 1 WISEiTech Case Study

    1. ,

    2.

    3.

    4. ? SNS ? ?

    5.

  • 2 WISEiTech Case Study

    Case Study

    , ,

    () .

    .

    ,

    .

    !

  • 3 WISEiTech Case Study

    Case Study

    .

    , ?

    1 ? -

    3 .

    .

  • 4 WISEiTech Case Study

    > >

    ()

    3 RDBMS

    ,

    Case Study

    ()

    ?

  • 5 WISEiTech Case Study

    ,

    BI (OLAP Report )

    ?

  • 6 WISEiTech Case Study

    Case Study - v.s

    .

    .

    , ,

    .

    ?

  • 7 WISEiTech Case Study

    ?

    3V?

    ( ) ?

    100 TB ?

    ,

  • 8 WISEiTech Case Study

    1. ,

    2.

    3.

    4. ? SNS ? ?

    5.

  • 9 WISEiTech Case Study

    Case Study - Global

    TV . TV app

    , Video .

    .

    . .

    2~3 ,

    50 . .

    ? ?

    ?

  • 10 WISEiTech Case Study

    Case Study - Global

    Global Public Cloud 2 Global Public Cloud 1

    ODS

    ,

    DW Mart

    Mart

    OLAP

    Reporting

    ODS : Operational Data Store DW : Data Warehouse OLAP : On-Line Analytical Processing

    RDB BI

  • 11 WISEiTech Case Study

    Case Study - Global

    .

    .

    . .

    SW .

    ?

    . . ,