best practices for virtualizing apache hadoop

26
© Hortonworks Inc. 2013 Best Practices Virtualizing Hadoop George Trujillo

Upload: hortonworks

Post on 25-May-2015

5.582 views

Category:

Technology


1 download

DESCRIPTION

Join this webinar to discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.

TRANSCRIPT

Page 1: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Best Practices Virtualizing Hadoop

George Trujillo

Page 2: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

George Trujillo

§ Master Principal Big Data Specialist - Hortonworks § Tier One Big Data/BCA Specialist – VMware Center of Excellence § VMware Certified Instructor (VMware Certified Professional) § MySQL Certified DBA § Sun Microsystem's Ambassador for Java Platforms § Author of Linux Administration and Advanced Linux Administration

Video Training § Recognized Oracle Double ACE by Oracle Corporation § Served on Oracle Fusion Council & Oracle Beta Leadership Council,

Independent Oracle Users Group (IOUG) Board of Directors, Recognized as one of the “Oracles of Oracle” by IOUG

Page 2

Page 3: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Agenda

• Hypervisor’s today • Building an enterprise virtual platform • Virtualizing Master and Slave servers • Best practices • Deploying Hadoop in public and private clouds

Page 3

Page 4: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Hypervisors Today: Faster/Less Overhead

• VMware vSphere, Microsoft Hyper-V Server, Citrix XenServer and RedHat RHEV

Page 4

Hypervisor Performance Benchmarks % Overhead VMware 1M IOPS with 1 microsecond of latency (5.1) 2 – 10% KVM 1M transactions/minute (IBM hardware RHEL) < 10%

Hypervisor Performance vSphere 5.1 VMware vCPUs 64

RAM per VM, RAM per Host 1TB / 2TB Network 36 GB/s IOPS/VM 1,000,000

Page 5: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Why Virtualize Hadoop? • Virtual Servers offer advantages over Physical Servers

• Standardization: On a Single Common software stack • Higher consistency and reliability due to abstracting the

hardware environment • Operational flexibility with vMotion, Storage vMotion, Live

Cloning, template deployments, hot memory and CPU add, Distributed Resource Scheduling, private VLANs, Storage and Network I/O control, etc.

• Virtualization is a natural step towards the cloud • Enabling Hadoop as a service in a public or private cloud

• Cloud providers are making it easy to deploy Hadoop for POCs, dev and test environments

• Cloud and virtualization vendors are offering elastic MapReduce solutions

Page 5

Page 6: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Virtualization Features

Page 6

Faster provisioning Live Cloning

Live migrations Templates

Live storage migrations Distributed Resource Scheduling

High Availability Hot CPU and Memory add

Live Cloning VM Replication

Network isolation using VXLANs Multi-VM trust zones

VM Backups Distributed Power Management

Elasticity Multi-tenancy

Storage/Network I/O Control Private virtual networks

16Gb FC Support iSCSI Jumbo Frame Support

Note: Features/functionality dependent on the hypervisor

Page 7: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Hortonworks Data Platform

Building an Enterprise Virtual Platform

Page 7

Hardware

Linux Windows

Distributed Storage (HDFS)

Distributed Processing (MapReduce)

Hive (Query)

Pig (Scripting)

HCatalog (Metadata Mgmt)

Zookeeper (Coordination)

HBase (Column DB)

WebHCatalog (Rest-like APIs)

Ambari (Management)

Mahout (Machine Learning)

Oozie (Workflow)

Ganglia (Monitoring)

Nagios (Alerts)

Sqoop (DB Transfer)

WebHDFS (REST API)

“Others” (Talend, Informatica, etc.)

Data Extraction And Load

Management Monitoring

Hadoop Essentials

Core Hadoop (kernel)

FlumeNG (Data Transfer)

Hypervisor

Page 8: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Virtualizing Hadoop

• The primary goal of virtualizing master and slave servers is the same, to maximize operational efficiency and leverage existing hardware.

• However the strategy for virtualizing Hadoop master servers is different than virtualizing Hadoop slave servers. – Hadoop master servers can follow virtualization best practices and

guidelines for tier1 and business critical environments. – Hadoop slave servers need to follow virtualization best practices and

also use Hadoop Virtual Extensions so a Hadoop cluster is “virtual aware”.

Page 8

Page 9: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Virtualizing Master Servers

• Virtualize the master servers (NameNode, JobTracker, HBase Master, Secondary NameNode) – Consider any key management servers: Ganglia, Nagios, Ambari,

Active Directory, Metadata databases

• Goals of a virtual enterprise Hadoop platform: – Less down time (Live migrations, cloning, …) – A more reliable software stack – A higher Quality of Service – Reduced CapEx and OpEx – Increased operational flexibility with virtualization features – VMware High Availability (with five clicks)

• Shared storage for the Hadoop master servers is required to fully leverage virtualization features.

Page 9

Page 10: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Configure Environment Properly

• Do not overcommit SLA or production environments • Size virtual machines to avoid entering host “soft” memory state and the likely breaking of host large pages into small pages. Leave at least 6% of memory for the hypervisor and VM memory overhead is conservative. – If free memory drops below minFree (“soft” memory state),

memory will be reclaimed through ballooning and other memory management techniques. All these techniques require breaking host large pages into small pages.

• Leverage hyperthreading – make sure there is hardware and BIOS support – Hyper Threading – can improve performance up to 20%

• Do not set memory limits on production servers.

Page 10

Page 11: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Configure Environment Properly (2)

• Run latest version of hypervisor, BIOS and virtual tools • Verify BIOS settings enable all populated processor sockets and enable all cores in each socket.

• Enable “Turbo Boost” in BIOS if processors support it. • Disabling hardware devices (in BIOS) can free interrupt resources. – COM and LPT ports, USB controllers, floppy drives, network

interfaces, optical drives, storage controllers, etc

• Enable virtualization features in BIOS (VT-x, AMD-V, EPT, RVI)

• Initially leave memory scrubbing rate at manufacturer’s default setting.

Page 11

Page 12: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

More Best Practices

• Configure an OS kernel as a single-core or multi-core kernel based on the number of vCPUs being used.

• Understand how NUMA affects your VMs – try to keep the VM size within the NUMA node – Look at disabling node interleaving (leave NUMA enabled) – Maintain memory locality

• Let hypervisor control power mgmt by BIOS setting “OS Controlled Mode”

• Enable C1E in BIOS • Have a very good reason for using CPU affinity otherwise avoid it like the plague

Page 12

Page 13: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Linux Best Practices

• Kernel parameters: – nofile=16384 – nproc=32000 – Mount with noatime and nodiratime attributes disabled – File descriptors set to 65535 – File system read-ahead buffer should be increased to 1024 or 2,048. – Epoll file descriptor limit should be increased to 4096

• Turn off swapping • Use ext4 or xfs (mount noatime)

– Ext can be about 5% better on reads than xfs – XFS can be 12-25% better on writes (and auto defrags in the

background)

• Linux 2.6.30+ can give 60% better energy consumption.

Page 13

Page 14: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Networking Best Practices

• Separate VM traffic from live migration and management traffic – Separate NICs with separate vSwitches

• Leverage NIC teaming (at least 2 NICS per vSwitch) • Leverage latest adapters and drivers from hypervisor vendor

• Be careful with multi-queue networking: Hadoop drives a high packet rate, but not high enough to justify the overhead of multi-queue.

• Network: – Channel bonding two GbE ports can give better I/O performance – 8 Queues per port

Page 14

Page 15: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Networking Best Practices (2)

• Evaluate these features with network adapters to leverage hardware features: – Checksum offload – TCP segmentation offload(TSO) – Jumbo frames (JF) – Large receive offload(LRO) – Ability to handle high-memory DMA (that is, 64-bit DMA

addresses) – Ability to handle multiple Scatter Gather elements per Tx frame

• Optimize 10 Gigabit Ethernet network adapters – Features like NetQueue can significantly improve performance of

10 Gigabit Ethernet network adapters in virtualized environments.

Page 15

Page 16: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Storage Best Practices

• Make good storage decisions – i.e. VMFS or Raw Device Mappings (RDM)

– VMDK – leverages all features of virtualization – RDM – leverages features of storage vendors (replication,

snapshots, …) – Run in Advanced Host Controller interface mode (AHCI). – Native Command Queuing enabled (NCQ)

• Use multiple vSCSI adapters and evenly distribute target devices

• Use eagerzeroedthick for VMDK files or uncheck Windows “Quick Format” option

• Makes sure there is block alignment for storage

Page 16

Page 17: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Virtualizing Data Servers

• HVE is a new feature that extends the Hadoop topology awareness mechanism to support rack and node groups with hosts containing VMs. – Data locality-related policies maintained within a virtual layer

• HVE merged into branch-1 – Available in Apache Hadoop 1.2, HDP 1.2 – https://issues.apache.org/jira/browse/HADOOP-8817

• Extensions include: – Block placement and removal policies – Balancer policies – Task scheduling – Network topology awareness

Page 17

Page 18: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

HVE: Virtualization Topology Awareness

Page 18

Host8

Rack1

Data Center

Rack2

NodeG 3 NodeG 4

Host7 VM VM

VM VM

VM VM

VM VM

Host6 Host5 VM VM

VM VM

VM VM

VM VM

Host4 Host3 VM VM

VM VM

VM VM

VM VM

Host2 Host1 VM VM

VM VM

VM VM

VM VM

NodeG 1 NodeG 2

• HVE is a new feature that extends the Hadoop topology awareness mechanism to support rack and node groups with hosts containing VMs. – Data locality-related policies maintained within a virtual layer.

Page 19: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

HVE: Replica Policies

Page 19

Standard Replica Policies Extension Replica Policies 1st replica is on local (closest) node of the writer

Multiple replicas are not be placed on the same node or on nodes under the same node group

2nd replica is on separate rack of 1st replica;

1st replica is on the local node or local node group of the writer

3rd replica is on the same rack as the 2nd replica;

2nd replica is on a remote rack of the 1st replica

Remaining replicas are placed randomly across rack to meet minimum restriction.

Multiple replicas are not placed on the same node with standard or extension replica placement/removal policies. Rules are maintained for the balancer.

Page 20: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Follow Virtualization Best Practices

Page 20

§ Validate virtualization and Hadoop configurations with vendor hardware compatibility lists.

Hardware

§ Follow recommended Hadoop reference architectures. Hadoop

§ Review storage vendor recommendations. Storage

§ Follow virtualization vendors best practices, deployment guides and workload characterizations. Virtualization

§ Validate internal guidelines and best practices for configuring and managing corporate VMs. Internal

Page 21: Best Practices for Virtualizing Apache Hadoop

Benefits of Running Hadoop in a Private Cloud

Elastic Hadoop

•  Create pool of cluster nodes

• On demand cluster scale up/down

Multi-tenant Hadoop

•  Better isolate workloads and enforce organizational security boundaries

CapEx reduction

•  Better utilization of physical servers

•  Cluster ‘timeshare’ •  Promote responsible usage

through chargeback/showback

OpEx reduction

•  Rapid provisioning & self provisioning

•  Simplify cluster maintenance

LEAD TO

Page 22: Best Practices for Virtualizing Apache Hadoop

Hortonworks & Rackspace Partnership

•  Goal: – Enable Hadoop to run efficiently in OpenStack based

public and private cloud environments •  Where we stand

– Rackspace public cloud service available soon ( Q3CY13)

– Continued work on enabling Hortonworks data platform to run efficiently on Rackspace OpenStack private cloud platform

•  Project Savannah – Automate the deployment of Hadoop on enterprise

class OpenStack clouds.

Page 23: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Final Thoughts

• Virtualization features can provide operational advantages to a Hadoop cluster.

• A lot of companies have expertise in virtualizing tier two/three platforms but not tier one. Be careful of growing pains.

• Can your organization handle the jump of moving to Hadoop and managing an enterprise virtual infrastructure at the same time?

• Give Hadoop Virtual Extensions time to bake. • Organizations are increasing their percentage of virtual servers and cloud deployments. They do not want to take a step back into physical servers unless they have to.

Page 23

Page 24: Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013

Next Steps

Page 24

Download Hortonworks Sandbox www.hortonworks.com/sandbox

Download Hortonworks Data Platform www.hortonworks.com/download

Register for Hadoop Series www.hortonworks.com/webinars

Page 25: Best Practices for Virtualizing Apache Hadoop

Hadoop Summit

Page 25 Architecting the Future of Big Data

•  June 26-27, 2013- San Jose Convention Cntr •  Co-hosted by Hortonworks & Yahoo! •  Theme: Enabling the Next Generation

Enterprise Data Platform •  90+ Sessions and 7 Tracks: •  Community Focused Event

–  Sessions selected by a Conference Committee –  Community Choice allowed public to vote for

sessions they want to see

•  Training classes offered pre event –  Apache Hadoop Essentials: A Technical

Understanding for Business Users –  Understanding Microsoft HDInsight and Apache

Hadoop –  Developing Solutions with Apache Hadoop –

HDFS and MapReduce –  Applying Data Science using Apache Hadoop

hadoopsummit.org

Page 26: Best Practices for Virtualizing Apache Hadoop

Thank You For Attending

Best Practices for Virtualizing Hadoop

George Trujillo Blog: http://cloud-dba-journey.blogspot.com Twitter: GeorgeTrujillo