track b-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

26
解構數據架構 數據系統的伺服器與網路資源規劃 “How to eat an elephant – one byte at a time” CP Li 李俊邦 Enterprise Technologist Enterprise Solutions & Alliances, Greater China Dell

Upload: etu-solution

Post on 06-Aug-2015

1.020 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

解構⼤大數據架構 ⼤大數據系統的伺服器與網路資源規劃 “How to eat an elephant – one byte at a time”

CP Li 李俊邦

Enterprise Technologist

Enterprise Solutions & Alliances, Greater China

Dell

Page 2: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

2

議程

1.  不同的伺服器⾓角⾊色 1.  Manager 2.  Name Nodes 3.  Edge Nodes 4.  Data Nodes

2.  Hadoop Cluster設計

3.  Etu+Dell

4.  Futures / Roadmap

5.  Questions?

Page 3: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

3

Server Roles - Manager

•  系統安裝圖形介⾯面/ 主控台

•  ⼤大多安裝在Edge Node

•  常⾒見版本 –  Cloudera Manager –  Apache Ambari

Page 4: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

4

Server Roles – Name Nodes •  存放HDFS的metadata

•  Job Manager for YARN data-processing framework

•  Primary –  Heartbeats from data nodes –  10th heartbeat is a block report from which it generates

metadata

•  Standby –  Checks in every hour to mirror metadata / block map –  Not a hot-spare – requires manual fail-over

•  High Availability (HA) can be added in some distributions

–  Results in a dedicated HA node that acts as a witness to the Name Node cluster

Page 5: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

5

Server Roles - Edge Nodes

•  資料進出Hadoop叢集的主要端⼝口

•  可擴展

•  Hadoop叢集裡唯⼀一的多網段節點

PowerEdge  R730  –  Name  Node

PowerEdge  R730  –  Standby  Name  Node

PowerEdge  R730  –  Edge  Node(s)

PowerEdge  R730  –  HA  Node

Corporate  Network Data  Network

Corporate

Data  Network

Data  Network

Data  Network

Data  Network

PowerEdge  R730XD  –  Data  Nodes

Data  Network

Page 6: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

6

Server Roles - Data Node

•  HDFS的主要存放處

•  執⾏行YARN資源管理所指定的資料處理

•  主要屬性 –  記憶體

›  標配64GB ›  更多服務(Impala/Spark) 需要更多記憶體

–  很多的本地硬碟 (JBOD / Non-RAID mode) ›  SFF (2.5”) for performance-based workloads ›  LFF (3.5”)for capacity-centric workloads

–  CPUs – legacy recommendation of 1:1 core:spindle ratio ›  SSDs, faster HDD (10K+), and in-memory workloads make this less of an issue ›  10 and 12 core are the best practice default

Page 7: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

Hadoop Cluster Design

Page 8: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

8

Hadoop Cluster Design – Hardware Considerations

Page 9: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

9

Hadoop Cluster Deployment – Installation Best Practices

•  Use pre-built, assembled & cabled racks from vendor

•  ⾃自動佈署⼯工具 (ex: Open Crowbar)

•  Purchase nodes in standard size groups for easy capacity growth and ordering, not in single node increments

–  Common increments are ½ or full rack for easy deployment and sizing

•  For each type of hardware, purchase spare components to keep on site for easy, rapid repair

Page 10: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

10

Core Hadoop Use Cases

歸檔

⾼高硬碟/CPU⽐比 記憶體使⽤用低

法規需求 ⻑⾧長期歸檔

資料處理

⾼高硬碟/CPU⽐比 記憶體使⽤用中等

DW offload ETL offload

EDH 質量分析

IT Log分析

分析

⾼高核⼼心數 記憶體使⽤用⾼高

市場分析 詐欺預防 網路分析

Page 11: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

11

Common Hadoop Use Case to Ecosystem Tool Mapping

Page 12: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

12

Hadoop Use Case to Ratio Mapping

歸檔

1:2:1

資料處理

1:4:1

分析

2:8:1

CPU (Cores) : Memory (GB) : Disk (數量) – Data Node

Page 13: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

13

Node Considerations

Dell PowerEdge R730 Dell PowerEdge R730 Dell PowerEdge R730 Dell PowerEdge R730xd

Page 14: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

14

Node Considerations

Page 15: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

15

HDFS Capacity

•  HDFS protects information through replication of the data between nodes, the default Replication Factor is 3, but is configurable.

•  HDFS Raw Capacity = Number of Compute Nodes x Number of Drives x Capacity of Drives

•  HDFS Usable Capacity = HDFS Raw Capacity/Replication Factor

Page 16: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

16

Big Data Networking Best Practices

•  Traditional Ethernet is used since it’s affordable and already prevalent.

•  1GbE networking was used initially in early drafts of the solution but with the reduction in cost it’s much more efficient to go with 10GbE.

•  Multiple ports are teamed both for redundancy and throughput. LACP or software bonding are the most common methods.

•  IPv4 is most widely used. IPv6 has limited support at the OS and Hadoop level.

Page 17: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

17

Attributes of a Good Switch for Big Data

•  Non-blocking backplane

•  Deep per-port packet buffers (shared buffers do not work well). During sort/shuffle phases of map/reduce operations network traffic is so chaotic that it can saturate any and all shared buffers, impacting multiple host’s network performance.

•  Good choices: –  1GbE

›  S55 ›  S60

–  10GbE ›  S4810 ›  S5000

–  40GbE ›  Z9000 ›  Z9500 ›  S6000

Page 18: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

18

Dell Hadoop Solution Logical Diagram

Page 19: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

19

Scale-out Aggregation Layer

Page 20: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

20

Dell Points of Integration

•  VLT / VRRP is a very affordable way to team switches both at the ToR and the aggregation tiers. This makes the Dell Networking Force10 switches a great choice.

•  Active Fabric Manager –  Speeds up the creation and administration of the required VLT / VRRP configuration on the switches. –  Helps with capacity-planning as customer scale

Page 21: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

21

Big Data Networking Futures

•  40GbE onboard LOMs will begin to be used for high-volume clusters. Right now the cost:benefit ratio isn’t there yet.

•  As HPC and Big Data converge, we’ll start to see the use of IB for node-to-node connectivity.

•  In-memory (Spark / Impala) workloads are reducing the bottlenecks that used to exist at the disk and now move to the processor and network. Expect customers to be looking to increase core counts and network speed to overcome this.

Page 22: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

@Dell_Enterprise Enterprise Solutions

Etu+Dell = complete Hadoop/Big Data solution provider

Best of breed Cloudera partners

- Etu

Analytic software solutions for Big Data

Dell Professional Services for Big Data

Dell PowerEdge 13G servers

Dell Networking solutions

Installation and configuration service Complete end-to-end implementation

Discover Plan Implement Investigate

Page 23: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

23

2. Store 1. Integrate

4. Act

3. Analyze

Solution architecture

Analytical output

Toad Data Point Desktop – integrate, cleanse

Dell Boomi Cloud – integrate, correlate

Toad Intelligence Central

Data aggregation and virtualization

Dell STATISTICA

Customer data

Order data

Events

Stock market data

Advanced Analytics

Marketing campaigns

Dell Statistica Big Data Desktop – crawl, save

Social Media

Page 24: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

24

Futures

•  Speed Improvements in Map / Reduce

•  More in-memory workloads –  Possible move to Spark to replace Map/Reduce

•  Virtualized Hadoop –  VMWare Big Data Extensions –  Openstack Sahara –  Microsoft HDInsights (Hortonworks)

Page 25: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

25

Dell In-Memory Appliance for Cloudera Enterprise Configurations at a glance

Mid-Size Configuration

16 Node Cluster PowerEegeR720- 4 Infrastructure Nodes

with ProSupport PowerEdgeR720XD- 12 Data Nodes with

ProSupport Cloudera Enterprise

Force10- S4810P Force10- S55 Dell Rack 42U

~528TB (disk raw space)

Starter Configuration

8 Node Cluster PowerEdge R720- 4 Infrastructure Nodes

with ProSupport PowerEdgeR720XD- 4 Data Nodes with

ProSupport Cloudera Enterprise

Force10- S4810P Force10- S55 Dell Rack 42U

~176TB (disk raw space)

Small Enterprise Configuration

24 Node Cluster PowerEdgeR720- 4 Infrastructure Nodes

with ProSupport PowerEdgeR720XD- 20 Data Nodes with

ProSupport Cloudera Enterprise

Force10- S4810P Force10- S55 Dell Rack 42U

~880TB (disk raw space)

Expansion Unit- PowerEdgeR720XD-4 Data Nodes w ProSupport, Cloudera Enterprise, Scales in Blocks

Page 26: Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃