introduction to big data for (university) … data for system... · •graph: allegro, neo4j,...

52
INTRODUCTION TO BIG DATA FOR (UNIVERSITY) SYSTEM ADMINISTRATOR Asst. Prof. Natawut Nupairoj, Ph.D. Mobile Application and System Services Research Group Head of Department Department of Computing Engineering Chulalongkorn University [email protected]

Upload: hoangthuy

Post on 31-Jan-2018

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

INTRODUCTION TO BIG DATA FOR (UNIVERSITY) SYSTEM ADMINISTRATOR

Asst. Prof. Natawut Nupairoj, Ph.D.Mobile Application and System Services Research GroupHead of DepartmentDepartment of Computing EngineeringChulalongkorn [email protected]

Page 2: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 3: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

“ขอมลจะมความส าคญทางเศรษฐกจ เหมอนกบเงนและทอง” - World

Economic Forum

“ในป 2020, ขอมลในโลกทงหมดจะมขนาด 40ZB หรอ 5.2TB ตอคนหนงคน” – IDC

“มขอมลเพยง 3% เทานนทพรอมถกน าไปใชงาน และมเพยง 1 ใน 6 ของขอมลทพรอมถกน าไปใชงาน หรอ 0.5% ของขอมลทงหมด ทสามารถน าไปวเคราะหได” – IDC

B | KB | MB | GB | TB | PB | EB | ZB

Page 4: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 5: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 6: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 7: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 8: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ลกษณะของ BIG DATA

Source: IBM

Page 9: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ตวอยางเลกๆ BIG DATA ของมหาวทยาลย

Page 10: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ขนาดการจดเกบส าหรบ 30 วน = 13,000,000 events (2.1TB)

Page 11: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 12: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

MOBILE & DEVICES - COMPUTING EVERYWHERE

Thailand’s rate is 147% (smartphone = 49%)

Wearable devices’ shipment will be doubled in 4 years (from 72m in 2015 to 155m in 2019)

20% will be healthcare related devices

Page 13: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Source: http://www.wareable.com/wearable-watchlist/50-best-wearable-tech

Whistle

Page 14: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

INTRODUCING FDA-APPROVED INGESTIBLE SENSORS IN PILLS

http://www.forbes.com/sites/singularity/2012/08/09/no-more-skipping-your-medicine-fda-approves-first-digital-pill/

Page 15: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 16: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 17: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Behavioral trend tracking – customize fitness program setupFood intake tracking - visual recognize food intakeEnvironment factor tracking – modify fitness program recommendation

Page 18: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Under Armour | Connected Life

Page 19: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

แนวทางการใชงาน BIG DATA

Bigger / Faster / More Up-to-Date Data Warehouse

Product Recommendation

Social Listening

Fraud Detection and Risk Management

Micro Customer Segmentation

Demand Sensing for Supply Chain

Precision Medicine

Page 20: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

แนวทางการใชงานในมหาวทยาลย

Storage ส าหรบการเกบขอมลขนาดใหญราคาถกจดเกบและการวเคราะห Log

Smart IDS

การวเคราะห User Experiences ของ Web Site / Mobile Site

การท า Crowdsourcing เกยวกบปญหาของ Wifi

การวเคราะหพฤตกรรมของการใช LMS และสอ Online ของนสตPrecision Education

Page 21: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 22: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Source: collegestats.org

Page 23: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Source: collegestats.org

Page 24: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT
Page 25: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Data Storage

(Primitive) Big Data Architecture

Data Ingestion NoSQL

MapReduce

Data Visualization

VolumeVelocityVariety

Data Source

GatherFilterDeliver

Data Processing

Page 26: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Opensource software framework โดยมแนวความคดจาก Google Search

Engine Architecture

เนนการใช Commodity Hardware

Map-Reduced ท าใหงายตอการเขยนโปรแกรมท างานบน Cluster โดยไมจ าเปนตองช านาญดาน Parallel Processing

ม Hadoop File System (HDFS) ในการจดเกบขอมลท reliable ในราคาไมแพง

ผใช: Yahoo!, Facebook, Amazon, eBay, American Airline, Apple, Google,

HP, IBM, Microsoft, Netflix, New York Times, ฯลฯ

Page 27: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ตวอยางจากของจรง

500,000 บาท

Intel NUCintel Core i5 (4cores)RAM 16 Gb

24,500 บาท x 20 เครอง80 cores

RAM 320 Gb

World-Class Brand Serverintel XEON (Up to 18 cores)RAM 512Gb

Page 28: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

HARDWARE VS. SOFTWARE

Hardware: Reliable Software: easy

Hardware: VulnerableSoftware : ????

Page 29: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ประเดนของ BIG DATA อยท I/OA B C

Config Single RAID-10 Parallel

จ านวน HD 1 8 16

ความเรว 100 MB/sec 800 MB/sec 1600 MB/sec

เวลาในการอาน 200GB 30 นาท 4 นาท 2 นาท

Page 30: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

หลกการท างานของ MAPREDUCE

1. ขอมลกระจายในเครองตาง2. MAP – ท าการประมวลผลในแตละเครองพรอมๆกน3. REDUCE - สรปผลกลบมาทเครองหลก

Page 31: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ตวอยาง – WORD COUNT

นบความถของค าในหนงสอ

Page 32: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

WORD FREQ.: MAPREDUCE

With your data, please count.

Store a part of data. MapMap Map

Map Map

Reduce

Page 33: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

DISKS

อายการใชงานเฉลย 1,200,00

ชม.

ส าหรบ Disk 10,000 ลก จะมลกทเสย 1 ลกทกๆ 5 วน

Source: google

Page 34: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

HADOOP HDFS

Rackaware

3 copy

Page 35: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

การท างานของ HADOOP

Page 36: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

HADOOP ARCHITECTURE

Page 37: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ระบบงานประมวลผลโดยใชหนวยความจ าเปนหลก (In-Memory Data

Processing) ของ UC Berkeley

ขยาย MapReduce ใหรองรบ batch executions, interactive queries, และstream processing

รองรบหลายภาษา ทง Java, Python, Scala, และ R และม analytic libraries

(machine learning, graph processing)

ไดรบความรวมมอในการพฒนา และการสนบสนนจากคนทวโลกเรวกวา Hadoop 10-100 เทา

Page 38: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ประสทธภาพของ SPARK

Page 39: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

NOSQL – NOT ONLY SQL

เปนทางเลอกในการเกบขอมลขนาดใหญ โครงสรางซบซอน โดยเปนระบบกระจาย ทท างานแบบ Non-relational

และรองรบการ Scale-Out

• Column: Accumulo, Cassandra, HBase

• Document: Apache CouchDB, Couchbase, MongoDB

• Search Engine: ElasticSearch, Solr

• Key-value: CouchDB, Dynamo, MemcacheDB, Redis

• Graph: Allegro, Neo4J, InfiniteGraph, OrientDB

Page 40: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

SELECT array_agg(players), player_teamsFROM (SELECT DISTINCT t1.t1player AS players, t1.player_teamsFROM (

SELECTp.playerid AS t1id,concat(p.playerid,':', p.playername, ' ') AS t1player,array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams

FROM player pLEFT JOIN plays pl ON p.playerid = pl.playeridGROUP BY p.playerid, p.playername

) t1INNER JOIN (SELECT

p.playerid AS t2id,array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams

FROM player pLEFT JOIN plays pl ON p.playerid = pl.playeridGROUP BY p.playerid, p.playername

) t2 ON t1.player_teams=t2.player_teams AND t1.t1id <> t2.t2id) innerQueryGROUP BY player_teams

Page 41: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

CAP THEOREM (BREWER’S THEOREM)

โดย Eric Brewer (University of California, Berkeley)

ระบบกระจายใดๆ ทม server หลายเครอง จะไมสามารถมคณสมบตตอไปนทง 3 อยางพรอมกน

• Consistency: ทกเครองมขอมลเหมอนกนตลอดเวลา• Availability: ทกการรองขอในการจดการขอมลจาก Client จะไดรบการตอบกลบ ไมวาจะส าเรจหรอไม

• Partition tolerance: ระบบสามารถท างานตอไปได แมเครอง server ไมสามารถสงขอมลระหวางกนได

Page 42: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

CAP - NORMAL OPERATION – C+A

Source: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Page 43: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

CAP - NETWORK PARTITION – ไดแค A เทานน

Source: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Page 44: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

CAP THEOREM AND NOSQL

Source: http://blog.flux7.com/blogs/nosql/cap-theorem-why-does-it-matter

Page 45: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

Source: http://db-engines.com/en/ranking

Page 46: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ตวอยาง NOSQL - MONGODB

Document-Oriented NoSQL database

BSON store (binary-format JSON)

Databases – Collections - Documents

รองรบหลาย Schema ในเวลาเดยวกน = Document ใน Collection เดยวกนสามารถมโครงสราง (ฟลด) ตางกนไดใช JavaScript เปนภาษาหลกในการเขาถงขอมล และม Driver ส าหรบภาษาอนๆเชน Java และ Python

รองรบ load-balancing และ replication

Page 47: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

{"firstName": "John","lastName": "Smith","isAlive": true,"age": 25,"height_cm": 167.6,"address": {

"streetAddress": "21 2nd Street","city": "New York","state": "NY","postalCode": "10021-3100"

},"phoneNumbers": [

{ "type": "home","number": "212 555-1234"

},{ "type": "office","number": "646 555-4567"

}],

}

Page 48: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

PREDICTIVE ANALYTICSเปนเครองมอใหกบ Data Scientist ในการวเคราะหหารปแบบของขอมลในอดต เพอใชท านายอนาคตมเทคนคหลายรปแบบทง statistics,

modeling, machine learning, data

mining, time series analysis, deep

learning, text analytics, image

processing, location analytics,

ฯลฯ

Page 49: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

ประเภทของ DATA ANALYTICS

Page 50: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

BIG DATA ARCHITECTURE ในการท างานจรง

Data Source

Data Source

Data Source

Data Source

Data Ingestion

Fast Data Path

Big Data Path

Data Stream Processors

Data Lake (Landing Zone)

Data Refinery / Data Analytics

Data Visualization

Traditional Data Warehouse / Reporting tools

Page 51: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT

เทยบ BIG DATA ARCHITECTURE ในการกบระบบ LOG

(GRAYLOG)

Data Source

Data Source

Data Source

Data Source

Data Ingestion

Fast Data Path

Big Data Path

Data Stream Processors

Data Lake (Landing Zone)

Data Refinery / Data Analytics

Data Visualization

KafkaElasticSearch

Graylog-WebGraylog-Event

Page 52: INTRODUCTION TO BIG DATA FOR (UNIVERSITY) … Data for System... · •Graph: Allegro, Neo4J, InfiniteGraph, OrientDB. SELECT array_agg(players), player_teams FROM (SELECT DISTINCT