big data infrastructure. - doag.org

34
BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 2014 © Trivadis Big Data Infrastructure. Appliance, Cloud, or Do-it-Yourself. Daniel Steiger Discipline Manager Infrastructure Engineering DOAG Jahreskonferenz 2014 Big Data Infrastructure 1

Upload: others

Post on 25-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Infrastructure. - doag.org

2014 © Trivadis

BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

2014 © Trivadis

Big Data Infrastructure. Appliance, Cloud, or Do-it-Yourself. Daniel Steiger Discipline Manager Infrastructure Engineering

DOAG Jahreskonferenz 2014 Big Data Infrastructure

1

Page 2: Big Data Infrastructure. - doag.org

2014 © Trivadis

Trivadis ist führend bei der IT-Beratung, der Systemintegration, dem Solution-Engineering und der Erbringung von IT-Services mit Fokussierung auf und Technologien im D-A-CH-Raum. Unsere strategischen Geschäftsfelder...

Unser Unternehmen

DOAG Jahreskonferenz 2014 Big Data Infrastructure

2

Page 3: Big Data Infrastructure. - doag.org

2014 © Trivadis

Mit über 600 IT- und Fachexperten bei Ihnen vor Ort

3

12 Trivadis Niederlassungen mit über 600 Mitarbeitenden

200 Service Level Agreements

Mehr als 4'000 Trainingsteilnehmer

Forschungs- und Entwicklungs-budget: CHF 5.0 Mio. / EUR 4.0 Mio.

Finanziell unabhängig und nachhaltig profitabel

Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden

(Stand 12/2013) 3

DOAG Jahreskonferenz 2014 Big Data Infrastructure

3

Hamburg

Düsseldorf

Frankfurt

Freiburg München Wien

Basel Zürich Bern

Lausanne

Stuttgart

Brugg

Page 4: Big Data Infrastructure. - doag.org

2014 © Trivadis

1.  Big Data Infrastructure Challenges

2.  Hadoop on an Appliance

3.  Hadoop in the Cloud

4.  Hadoop Do-it-Yourself

5.  Conclusion

DOAG Jahreskonferenz 2014 Big Data Infrastructure

4

Agenda

Page 5: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

5

Big Data Infrastructure Challenges

Page 6: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

6

Trailwise – a "quantified self" use case

47'295 data points rendered in 643ms

11'000 data points rendered in 165ms

Page 7: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

7

Trailwise – Infrastructure for a Proof of Concept

7

§  Hadoop HDFS as data store

§  HBase for real-time data access

§  Hadoop Map/Reduce

Page 8: Big Data Infrastructure. - doag.org

2014 © Trivadis

Concerns…

§  Scalability

§  Costs for "always up"

§  Setup and administration of a large cluster on AWS

§  Break-even cloud vs on-premise

For a proof of concept hadoop in the cloud (e.g. on Amazon EC2) is perfect...

+  Fast and easy deployment

+  Optimized Hadoop/HBase setup

+  HBase real-time performance

+  Map/Reduce scalability

+  Affordable, ca. EUR 15.-/day

DOAG Jahreskonferenz 2014 Big Data Infrastructure

8

Trailwise – Infrastructure Lessons Learned

Page 9: Big Data Infrastructure. - doag.org

2014 © Trivadis

§  Big Data means big data volume §  Petabytes and exabytes

§  Scalability §  10, 20, 50, 100, ... cluster nodes §  Costs should scale as well...

§  High demands on machine-to-machine networks §  In Big Data for every one-client interaction, there may be hundreds or thousands of

server and data node interactions §  This generates far more east-west (server-to-server or server-to-storage) network traffic

than north-south (server-to-client or server-to-outside) network traffic

§  And many others like integration, data protection, operation, etc.

DOAG Jahreskonferenz 2014 Big Data Infrastructure

9

Big Data Infrastructure Challenges

Page 10: Big Data Infrastructure. - doag.org

2014 © Trivadis

§  Infrastructure must be engineered to scale

§  The network has to provide high bandwidth, low latency, and should scale seamlessly with Hadoop clusters to provide predictable performance

§  And many more, like §  Integration with operational data systems §  Authentication, authorization, encryption §  Centralized management

DOAG Jahreskonferenz 2014 Big Data Infrastructure

10

Infrastructure Requirements 7

Figure 1.2: Picture of a row of servers in a Google WSC, 2012.

1.6.1 STORAGEDisk drives or Flash devices are connected directly to each individual server and managed by a global distributed file system (such as Google’s GFS [58]) or they can be part of Network Attached Storage (NAS) devices directly connected to the cluster-level switching fabric. A NAS tends to be a simpler solution to deploy initially because it allows some of the data management responsibilities to be outsourced to a NAS appliance vendor. Keeping storage separate from computing nodes also makes it easier to enforce quality of service guarantees since the NAS runs no compute jobs be-sides the storage server. In contrast, attaching disks directly to compute nodes can reduce hardware costs (the disks leverage the existing server enclosure) and improve networking fabric utilization (each server network port is effectively dynamically shared between the computing tasks and the file system).

The replication model between these two approaches is also fundamentally different. A NAS tends to provide high availability through replication or error correction capabilities within each appliance, whereas systems like GFS implement replication across different machines and conse-quently will use more networking bandwidth to complete write operations. However, GFS-like systems are able to keep data available even after the loss of an entire server enclosure or rack and may allow higher aggregate read bandwidth because the same data can be sourced from multiple

1.6. ARCHITECTURAL OVERVIEW OF WSCS

Will my infrastructure meet my needs

now and in the future without putting my business at risk?

Page 11: Big Data Infrastructure. - doag.org

2014 © Trivadis

When enterprises adopt Hadoop, one of the decisions they must make is the deployment model. There are four options:

DOAG Jahreskonferenz 2014 Big Data Infrastructure

11

Where to Deploy your Hadoop Cluster?

When enterprises adopt Hadoop, one of the decisions they must make is the deployment model. There are four options as illustrated in Figure 1:

��On-premise full custom. With this option, businesses purchase commodity hardware, then they install software and operate it themselves. This option gives businesses full control of the Hadoop cluster.

��Hadoop appliance. This preconfigured Hadoop cluster allows businesses to bypass detailed technical configuration decisions and jumpstart data analysis.

��Hadoop hosting. Much as with a traditional ISP model, organizations rely on a service provider to deploy and operate Hadoop clusters on their behalf.

� Hadoop-as-a-Service. This option gives businesses instant access to Hadoop clusters with a pay-per-use consumption model, providing greater business agility.

To determine which of these options presents the right deployment model, organizations must consider five key areas. The first is the price-performance ratio, and it is the focus of this paper. The Hadoop-as-a-service model is typically cloud-based and uses virtualization technology to automate deployment and operation processes (in comparison, the other models typically use physical machines directly).

There have existed two divergent views related to the price-performance ratio for Hadoop deployments. One view is that a virtualized Hadoop cluster is slower because Hadoop’s workload has intensive I/O operations, which tend to run slowly on virtualized environments. The other view is that the cloud-based model provides compelling cost savings because its individual server node tends to be less expensive; furthermore, Hadoop is horizontally scalable.

The second area of consideration is data privacy, which is a common concern when storing data outside of corporate-owned infrastructure. Cloud-based deployment requires a comprehensive cloud-data privacy strategy that encompasses areas such as proper implementation of legal requirements, well-orchestrated data-protection technologies, as well as the organization’s culture with regard to adopting emerging technologies. Accenture Cloud Data Privacy Framework outlines a detailed approach to help clients address this issue.

The third area is data gravity. Once data volume reaches a certain point, physical data migration becomes prohibitively slow, which means that many organizations are locked into their current data platform. Therefore, the portability of data, the anticipated future growth of data, and the location of data must all be carefully considered.

A related and fourth area is data enrichment, which involves leveraging multiple datasets to uncover new insights. For example, combining a consumer’s purchase history and social-networking activities can yield a deeper understanding of the consumer’s lifestyle and key personal events and therefore enable companies to introduce new services and products of interest. The primary challenge is that the storage of these multiple datasets increases the volume of data, resulting in slow connectivity. Therefore, many organizations choose to co-locate these datasets. Given volume and portability considerations, most organizations choose to move the smaller datasets to the location of the larger ones. Thus, thinking strategically about where to house your data, considering both current and future needs, is key.

The fifth area is the productivity of developers and data scientists. They tap into the datasets, create a “sandbox” environment, explore the data analysis ideas, and deploy them into production. Cloud’s self-service deployment model tends to expedite this process.

Figure 1. The spectrum of Hadoop deployment options

On-premise full custom

Hadoop appliance

Hadoop hosting

Hadoop-as-a-Service

Bare-metal Cloud

2

Reference: Hadoop Deployment Comparison Study, Price-Performance Comparison, Accenture Technology Labs, 2013

Page 12: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

12

Hadoop on an Appliance Oracle Big Data Appliance

Page 13: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

13

Overview: Oracle's Big Data Solution

§  A complete and optimized solution for big data

§  Tight integration with Exadata, Exalogic, Exalytics and SPARC Supercluster using Infiniband network

§  Single-vendor support for both hardware and software

Page 14: Big Data Infrastructure. - doag.org

2014 © Trivadis

Full Rack Configuration (up to 18 racks)

§  18 x compute/storage nodes

Per Node:

§  2 x Eight-Core Intel ® Xeon ® E5-2650 V2 Processors

§  64 GB Memory (up to 512 GB)

§  48 TB Raw Storage Capacity

§  40 Gb/sec Infiniband Network

§  10 Gb/sec Data Center Connectivity

DOAG Jahreskonferenz 2014 Big Data Infrastructure

14

Oracle Big Data Appliance X4-2 HW

Sou

rce:

Ora

cle

®

Page 15: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

15

Oracle Big Data Appliance Internal Network Connectivity

Source: Oracle Big Data Appliance: Datacenter Network Integration, Oracle White Paper, 2012

Page 16: Big Data Infrastructure. - doag.org

2014 © Trivadis

§  Oracle R Distribution

§  Oracle NoSQL DB Community Ed.

§  BDA Enterprise Manager Plug-In

§  Optional Software* §  Oracle Big Data SQL §  Oracle Big Data Connectors §  Oracle Audit Vault & Database Firewall

for Hadoop Auditing §  Oracle Data Integrator §  Oracle NoSQL Database EE

§  Oracle Linux 6.4 with UEK

§  Oracle Java JDK 7

§  Cloudera Enterprise Data Hub Edition §  Apache Hadoop HDFS §  HBase §  Cloudera Impala §  Cloudera Search §  Cloudera Manager §  Apache Spark

DOAG Jahreskonferenz 2014 Big Data Infrastructure

16

Big Data Appliance Software Stack

*Connectors are licensed separately from Oracle Big Data Appliance

Page 17: Big Data Infrastructure. - doag.org

2014 © Trivadis

§  Oracle R Support for Big Data §  R is an open-source language and

environment for statistical analysis and graphing

§  The standard R distribution is installed on all nodes of Oracle Big Data Appliance

§  Oracle R Connector for Hadoop provides R users with high-performance, native access to HDFS and the MapReduce programming framework

§  Oracle R Enterprise is a separate package that provides real-time access to Oracle Database.

§  Oracle NoSQL Database §  Oracle NoSQL Database is a

distributed key-value database built on storage technology of Berkeley DB Java Edition.

§  An intelligent driver on top of Berkeley DB keeps track of the underlying storage topology, shards the data and knows where data can be placed with the lowest latency

DOAG Jahreskonferenz 2014 Big Data Infrastructure

17

BDA Specific Software Features

Page 18: Big Data Infrastructure. - doag.org

2014 © Trivadis

§  Oracle SQL Connector for HDFS

§  Oracle Loader for Hadoop

§  Oracle R Connector for Hadoop

§  Oracle Data Integrator Application Adapter for Hadoop

§  Data in HDFS (and NoSQL) data is accessable through relational database external table mechanism (HDFS as cluster file system)

*The connectors are licensed separately from Oracle Big Data Appliance

DOAG Jahreskonferenz 2014 Big Data Infrastructure

18

Oracle Big Data Connectors

Source: Oracle ® Reference: Oracle Big Data Connectors Data Sheet

Page 19: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

19

Oracle Big Data SQL: one tool for all data sources

Reference: https://www.oracle.com/webfolder/s/delivery_production/docs/FY15h1/doc6/1-T2-BigData.pdf

Page 20: Big Data Infrastructure. - doag.org

2014 © Trivadis

§  Oracle Big Data Lite VM §  http://www.oracle.com/technetwork/database/bigdata-appliance/

oracle-bigdatalite-2104726.html

§  MOS Notes §  Information Center: Oracle Big Data Appliance (Doc ID 1445762.2) §  Big Data Connectors (ID 1487399.2) §  Sqoop Frequently Asked Questions (FAQ) (Doc ID 1510470.1)

DOAG Jahreskonferenz 2014 Big Data Infrastructure

20

Oracle Big Data Appliance Ressources

Page 21: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

21

Hadoop in the Cloud

Page 22: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

22

Hadoop in the Cloud

Page 23: Big Data Infrastructure. - doag.org

2014 © Trivadis

There are five key areas to consider when choosing the right deployment model*:

*Public Cloud, Private Cloud, Community Cloud oder Hybrid Cloud

DOAG Jahreskonferenz 2014 Big Data Infrastructure

23

Deployment Considerations

The second area of consideration is data

privacy, which is a common concern when

storing data outside of corporate-owned

infrastructure. Cloud-based deployment

requires a comprehensive cloud-data

privacy strategy that encompasses

areas such as proper implementation of

legal requirements, well-orchestrated

data-protection technologies, as well as

the organization’s culture with regard

to adopting emerging technologies.

Accenture Cloud Data Privacy

Framework outlines a detailed approach

to help clients address this issue.

The third area is data gravity. Once

data volume reaches a certain point,

physical data migration becomes

prohibitively slow, which means that

many organizations are locked into

their current data platform. Therefore,

the portability of data, the anticipated

future growth of data, and the location

of data must all be carefully considered.

A related and fourth area is data

enrichment, which involves leveraging

multiple datasets to uncover new

insights. For example, combining

a consumer’s purchase history and

social-networking activities can yield a

deeper understanding of the consumer’s

lifestyle and key personal events

and therefore enable companies to

introduce new services and products of

interest. The primary challenge is that

the storage of these multiple datasets

increases the volume of data, resulting

in slow connectivity. Therefore, many

organizations choose to co-locate these

datasets. Given volume and portability

considerations, most organizations

choose to move the smaller datasets to

the location of the larger ones. Thus,

thinking strategically about where

to house your data, considering both

current and future needs, is key.

The fifth area is the productivity of

developers and data scientists. They tap

into the datasets, create a “sandbox”

environment, explore the data analysis

ideas, and deploy them into production.

Cloud’s self-service deployment model

tends to expedite this process.

Out of these five key areas, Accenture

assessed the price-performance ratio

between bare-metal Hadoop clusters and

Hadoop-as-a-Service on Amazon Web

ServicesTM. (A bare-metal Hadoop cluster

refers to a Hadoop cluster deployed

on top of physical servers without a

virtualization layer. Currently, it is the

most common Hadoop deployment

option in production environments.)

For the experiment, we first built

the total cost of ownership (TCO)

model to control two environments

at the matched cost level. Then, using

Accenture Data Platform Benchmark

as real-world workloads, we compared

the performance of both a bare-

metal Hadoop cluster and Amazon

ElasticMapReduce (Amazon EMRTM).

Employing these empirical and systemic

analyses, Accenture’s study revealed

that Hadoop-as-a-Service offers better

price-performance ratio. Thus, this

result debunks the idea that the cloud

is not suitable for Hadoop MapReduce

workloads, with their heavy I/O

requirements. Moreover, the benefit

of performance tuning is so huge that

cloud’s virtualization layer overhead

is a worthy investment as it expands

performance tuning opportunities.

Lastly, despite of the sizable benefit, the

performance tuning process is complex

and time-consuming, thus requires

automated tuning tools. The results

are explored in detail in our full study,

“Hadoop Deployment Comparison Study”.

Five key areas to consider when choosing the right deployment model:

Price-performance

ratioData privacy Data gravity Data

enrichment

Productivity of developers and data scientists

Reference: Where to Deploy your Hadoop Cluster?, Executive Summary, Accenture Technology Labs, 2013

Page 24: Big Data Infrastructure. - doag.org

2014 © Trivadis

EC2 Instance for Hadoop/MapReduce

Storage optimized – current generation

§  Instance "hs1.8xlarge" §  16 vCPUs (Intel Xeon) §  117GB RAM §  24 x 2000GB = 48TB §  10 Gigabit network

§  MapR as option §  M3, M5 or M7 edition

DOAG Jahreskonferenz 2014 Big Data Infrastructure

24

Amazon EMR with the MapR Distribution for Hadoop

Reference: http://aws.amazon.com/elasticmapreduce/mapr/

Page 25: Big Data Infrastructure. - doag.org

2014 © Trivadis

Costs for "hs1.8xlarge" Instance

§  Medium Utilization Reserved Instances §  1-Year term: upfront $9'200, $1.809 per Hour §  3-Year term: upfront $14'109, $1.581 per Hour

§  Data Transfer IN to Amazon EC2 from internet: $0.0 per GB

§  Data Transfer OUT from Amazon EC2 to internet: $0.12 per GB up to 10TB/month ($120 per TB)

§  MapR M7: $1.49 per Hour

§  Total: $2'600/month, $31'200/year (24/365 utilization)

DOAG Jahreskonferenz 2014 Big Data Infrastructure

25

Amazon EMR with the MapR Distribution for Hadoop

Page 26: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

26

Hadoop on Do-It-Yourself Infrastructure

Page 27: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

27

Do-it-Yourself (experimental setup)

Source: http://blog.ittoby.com/

Page 28: Big Data Infrastructure. - doag.org

2014 © Trivadis

HP ProLiant DL380p Gen8

§  2 x Eight-Core Intel ® Xeon ® E5-2650 V2

§  64 GB Memory (up to 512 GB)

§  48 TB Raw Storage Capacity

§  40 Gb/sec Infiniband Network

§  10 Gb/sec Data Center Connectivity

§  About $20'000 + Rack + Network + Work

DOAG Jahreskonferenz 2014 Big Data Infrastructure

28

Do-it-Yourself (enterprise class setup)

Technical white paper | HP Reference Architecture for MapR M5

11

This section specifies which server to use and the rationale behind it. The Reference Architectures section will provide topologies for the deployment of control and worker services across the nodes for clusters of varying sizes.

Processor configuration MapR manages the amount of work each server is able to undertake via the amount of Map/Reduce slots configured for that server. The more cores available to the server, the more Map/Reduce slots can be configured for the server (see the Computation section for more detail). We recommend 6 core processors for a good balance of price and performance. We recommend that Hyper-Threading is turned on.

Drive configuration Redundancy is built into the MapR architecture and thus there is no need for RAID or additional hardware components to improve redundancy on the server as it is all coordinated and managed in the MapR software.

MapR Benefit Drives should use a Just a Bunch of Disks (JBOD) configuration, which can be achieved with the HP P420 RAID controller by configuring each individual disk as a separate RAID 0 volume. We recommend disabling array acceleration on the controller to better handle large block I/Os in the Hadoop environment.

Lastly, servers should provide a large amount of storage capacity which increases the total capacity of the distributed file system and provide that capacity by using at least twelve 2TB Large Form Factor drives for optimum I/O performance. The DL380e supports 14 Large Form Factor (LFF) drives, which allows one to either use all 14 drives for data or use 12 drives for data and the additional 2 for mirroring the operating system and MapR runtime. Hot pluggable drives are recommended so that drives can be replaced without restarting the server.

Memory configuration Servers running the node processes should have sufficient memory for either HBase or for the amount of Map/Reduce Slots configured on the server. A server with larger RAM configuration will deliver optimum performance for both HBase and Map/Reduce. To ensure optimal memory performance and bandwidth, we recommend using 8GB or 16GB DIMMs to populate each of the 6 memory channels as needed.

Network configuration The DL380e includes four 1GbE NICs onboard. MapR automatically identifies the available NICs on the server and bonds them via the MapR software to increase throughput.

MapR Benefit Each of the reference architecture configurations below specifies an additional Top of Rack Switch for redundancy. To best make use of this, we recommend cabling the ProLiant DL380e Worker Nodes so that NIC 1 is cabled to Switch 1 and NIC 2 is cabled to Switch 2, repeating the same process for NICs 3 and 4. Each NIC in the server should have its own IP subnet instead of sharing the same subnet with other NICs.

HP ProLiant DL380e Gen8 The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the worker nodes.

Figure 6. HP ProLiant DL380e Gen8 Server

§  Cloudera Enterprise Data Hub Edition 5.x

§  ca. $2'500/node + support

Page 29: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

29

Conclusion

Page 30: Big Data Infrastructure. - doag.org

2014 © Trivadis

Oracle BDA + High performance scalable network architecture

+ Highly integrated into Oracle eco system

+ Complete software stack Oracle & Hadoop

+ Single point of support

+ Competitive price/ performance ratio for enterprise class demands

DOAG Jahreskonferenz 2014 Big Data Infrastructure

30

Appliance, Cloud or DIY?

Amazon EC2 Instances + Fast and easy deployment

+ Scales from very small to very large cluster setups

+ Capacity on demand on hourly base + Optional enterprise class hadoop distribution

+ Interesting price model for volatile utilisation and capacity on demand

Do it Yourself + Low entry point

+ Free choice of hardware

+ Free choice of software stack

Technical white paper | HP Reference Architecture for MapR M5

11

This section specifies which server to use and the rationale behind it. The Reference Architectures section will provide topologies for the deployment of control and worker services across the nodes for clusters of varying sizes.

Processor configuration MapR manages the amount of work each server is able to undertake via the amount of Map/Reduce slots configured for that server. The more cores available to the server, the more Map/Reduce slots can be configured for the server (see the Computation section for more detail). We recommend 6 core processors for a good balance of price and performance. We recommend that Hyper-Threading is turned on.

Drive configuration Redundancy is built into the MapR architecture and thus there is no need for RAID or additional hardware components to improve redundancy on the server as it is all coordinated and managed in the MapR software.

MapR Benefit Drives should use a Just a Bunch of Disks (JBOD) configuration, which can be achieved with the HP P420 RAID controller by configuring each individual disk as a separate RAID 0 volume. We recommend disabling array acceleration on the controller to better handle large block I/Os in the Hadoop environment.

Lastly, servers should provide a large amount of storage capacity which increases the total capacity of the distributed file system and provide that capacity by using at least twelve 2TB Large Form Factor drives for optimum I/O performance. The DL380e supports 14 Large Form Factor (LFF) drives, which allows one to either use all 14 drives for data or use 12 drives for data and the additional 2 for mirroring the operating system and MapR runtime. Hot pluggable drives are recommended so that drives can be replaced without restarting the server.

Memory configuration Servers running the node processes should have sufficient memory for either HBase or for the amount of Map/Reduce Slots configured on the server. A server with larger RAM configuration will deliver optimum performance for both HBase and Map/Reduce. To ensure optimal memory performance and bandwidth, we recommend using 8GB or 16GB DIMMs to populate each of the 6 memory channels as needed.

Network configuration The DL380e includes four 1GbE NICs onboard. MapR automatically identifies the available NICs on the server and bonds them via the MapR software to increase throughput.

MapR Benefit Each of the reference architecture configurations below specifies an additional Top of Rack Switch for redundancy. To best make use of this, we recommend cabling the ProLiant DL380e Worker Nodes so that NIC 1 is cabled to Switch 1 and NIC 2 is cabled to Switch 2, repeating the same process for NICs 3 and 4. Each NIC in the server should have its own IP subnet instead of sharing the same subnet with other NICs.

HP ProLiant DL380e Gen8 The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the worker nodes.

Figure 6. HP ProLiant DL380e Gen8 Server

Page 31: Big Data Infrastructure. - doag.org

2014 © Trivadis

§  Building an enterprise-class hadoop infrastructure is a challenge

§  Analyse and prioritize your requirements (business and IT) is crucial

§  Start „small & fast“ with a proof of concept

§  Consider various deployment models (On-Premis, Appliance, IaaS, PaaS, HaaS, ...)

§  The Oracle Database Appliance is a very competitive offering – especially as extension to your existing Oracle operational data systems

DOAG Jahreskonferenz 2014 Big Data Infrastructure

31

Conclusion

Page 32: Big Data Infrastructure. - doag.org

2014 © Trivadis 2014 © Trivadis

BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Thank you. Daniel Steiger Discipline Manager Infratructure Engineering

Tel: +41 58 459 50 88 [email protected]

DOAG Jahreskonferenz 2014 Big Data Infrastructure

32

Page 33: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

33

Trivadis an der DOAG Ebene 3 - gleich neben der Rolltreppe

Wir freuen uns auf Ihren Besuch. Denn mit Trivadis gewinnen Sie immer.

Page 34: Big Data Infrastructure. - doag.org

2014 © Trivadis

DOAG Jahreskonferenz 2014 Big Data Infrastructure

34

Cost comparison

A"ribute   Oracle  BDA   Amazon  EMR   DIY  Typ   X4-­‐2   hs1.8xlarge   DL-­‐380  CPU   2x8-­‐Core   16  vCPU   2x8-­‐Core  RAM   64  GB   117  GB   64  GB  Storage   48  TB   48  TB   8  TB  Network   10  GB  /  40  GB   10  GB   10  GB  /  40  GB  Hadoop  Distr.   Cloudera   MapR   Cloudera  Preis  /  Jahr   525'000    562'256      405'000    Wartung  /  Jahr   63'000    -­‐     40'000  Total  1.  Jahr   588'000    562'256      445'000    

Total  3  Jahre    714'000      1'686'768      525'000