emr hadoopmeetup

43
Getting Started with Hadoop with Amazon’s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/EMR-HadoopMeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrick son (Hadoop Meetup)  EMR-Hadoop  8 July 2010 1 / 43

Upload: nandan-bee

Post on 02-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 1/43

Getting Started with Hadoopwith Amazon’s Elastic MapReduce

Scott [email protected]

http://drskippy.net/projects/EMR-HadoopMeetup.pdf

Boulder/Denver Hadoop Meetup

8 July 2010

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 1 / 43

Page 2: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 2/43

Agenda

1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduceExample 1: Streaming Work Flow with AWS Management ConsoleExample 2 - Word count (Slightly more useful)Example 3 - elastic-mapreduce command line tool

4   References and Notes

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 2 / 43

Page 3: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 3/43

Amazon Web Services

What is Amazon Web Services?

For first Hadoop project on AWS, use these services:

Elastic Compute Cloud (EC2)

Amazon Simple Storage Service (S3)

Elastic MapReduce (EMR)

For future projects, AWS is much more:

SimpleDB, Relational Database Services

Simple Queue Service (SQS), Simple Notification Service (SNS)

AlexaMechanical Turk

. . .

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 3 / 43

Page 4: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 4/43

Amazon Web Services

Signing up for AWS

1 Create an AWS account -   http://aws.amazon.com/

2 Sign up for EC2 cloud compute services -http://aws.amazon.com/ec2/

3 Set up Security Credentials (under menu  Account|SecurityCredentials) - 3 kinds of credentials, you need to create an “AccessKey”; use it to access S3 storage

4 Sign up for S3 storage services -   http://aws.amazon.com/s3/

5 Sign up for EMR -   http://aws.amazon.com/elasticmapreduce/

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 4 / 43

A W b S i

Page 5: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 5/43

Amazon Web Services

What are S3 buckets?

Streaming EMR projects use Simple Storage Service (S3) Buckets fordata, code, logging and output.

Bucket  “A bucket is a container for objects stored in Amazon S3.Every object is contained in a bucket.” Bucket namesmust be globally unique.

Object  “Entities stored in Amazon S3. Objects consist of objectdata and metadata.” Metadata consists of key-value pairs.Object data is opaque.

Objects Keys  “An object is uniquely identified within a bucket by a key(name) and a version ID.”

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 5 / 43

A W b S i

Page 6: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 6/43

Amazon Web Services

Accessing objects in S3 buckets

Want to:

1 Move data into and out of S3 buckets

2 Set access privileges

Tools:

S3 console in your AWS control panel is adequate for managing S3buckets and objects one at a time

Other browser options: good for multiple file upload/download -Firefox S3https://addons.mozilla.org/en-US/firefox/addon/3247/  ; or

minimal - S3 plug-in for Chrome   https://chrome.google.com/extensions/detail/appeggcmoaojledegaonmdaakfhjhchf

Programmatic options: Web Services (both SOAP-y and REST-ful):wget, curl, Python, Ruby, Java . . .

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 6 / 43

Amazon Web Services

Page 7: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 7/43

Amazon Web Services

S3 Bucket Example 1 - RESTful GET

Example - Image objectBucket:   bsi-test

Key:   image.jpg

Object: JPEG structured data data from image.jpgRESTful GET access, use URL:https://reader010.{domain}/reader010/html5/0606/5b17678347a64/

Example - Text file object

Bucket:   bsi-test

Key:   foobarObject: textRESTful GET access, use URL:http://s3.amazonaws.com/bsi-test/foobar

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 7 / 43

Amazon Web Services

Page 8: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 8/43

Amazon Web Services

S3 Bucket Example 2

Example - Python, Boto, Metadata

from boto.s3.connection import S3Connection

conn = S3Connection(’key-id’, ’secret-key’)

bucket = conn.get_bucket(’bsi-test’)

k = bucket.get_key(’image.jpg’)print "Value for key ’x-amz-meta-s3fox-modifiedtime’ is:"

print k.get_metadata(’s3fox-modifiedtime’)

k.get_contents_to_filename(’deleteme.jpg’)

k = bucket.get_key(’foobar’)print "Object value for key ’foobar’ is:"

print k.get_contents_as_string()

print "Value for key ’x-amz-meta-example-key’ is:"

print k.get_metadata(’example-key’)

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 8 / 43

Amazon Web Services

Page 9: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 9/43

Amazon Web Services

S3 Bucket Example 2

Example - Python, Boto, Metadata - Output

scott@mowgli-ubuntu:~/Dropbox/hadoop$ ./botoExample.py

Value for key ’x-amz-meta-s3fox-modifiedtime’ is:

1273869756000

Object value for key ’foobar’ is:

This is a test of S3

Value for key ’x-amz-meta-example-key’ is:

This is an example value.

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 9 / 43

Amazon Web Services

Page 10: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 10/43

Amazon Web Services

What is Elastic Map Reduce?

Hadoop  Hosted Hadoop framework running on EC2 and S3.

Job Flow  Processing steps EMR “runs on a specified datasetusing a set of Amazon EC2 instances.”

S3 Bucket(s)   Input data, output, scripts, jars, logs.

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 10 / 43

Amazon Web Services

Page 11: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 11/43

Controlling Job Flows

Want to:1 Configure jobs

2 Start jobs

3 Check status or stop jobs

Tools:AWS Management Consolehttps://console.aws.amazon.com/elasticmapreduce/home

Command Line Tools

(requires Ruby [sudo apt-get install ruby  libopenssl-ruby])http://developer.amazonwebservices.com/connect/entry.

jspa?externalID=2264&categoryID=262

API calls defined by the service (REST-ful and SOAP-y)

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 11 / 43

Amazon Web Services

Page 12: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 12/43

EMR Example 1 - Running a simple Work Flow from the

AWS Management Console

EMR Example 1Hold up a minute. . . !

What problem are we solving?

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 12 / 43

Interlude: Solving problems with Map and Reduce

Page 13: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 13/43

g

Agenda

1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduceExample 1: Streaming Work Flow with AWS Management ConsoleExample 2 - Word count (Slightly more useful)Example 3 - elastic-mapreduce command line tool

4   References and Notes

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 13 / 43

Interlude: Solving problems with Map and Reduce

Page 14: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 14/43

Central MapReduce Ideas

Operate on key-value pairs

Data scientist provides  map and  reduce

(input )

< k 1, v 1 >map −−→   < k 2, v 2 >

< k 2, v 2 >combine ,sort −−−−−−−→   < k 2, v 2 >

< k 2, v 2 >reduce −−−−→   < k 3, v 3 >

(output )

(Optional:   Combine provided in  map, may significantly reduce

bandwidth between workers)Efficient  Sort provide by MapReduce library. Implies efficientcompare(k2a, k2b )

“Implicit” parallelization - splitting and distributing data, starting

maps, reduces, collecting outputScott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 14 / 43

Interlude: Solving problems with Map and Reduce

Page 15: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 15/43

Key components of MapReduce framework

(wikipedia  http://en.wikipedia.org/wiki/MapReduce)The frozen part of the MapReduce framework is a large distributed sort.The hot spots, which the application defines, are:

1 input reader

2 Map  function

3 partition function

4 compare function

5

Reduce function6 output writer

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 15 / 43

Interlude: Solving problems with Map and Reduce

Page 16: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 16/43

Google Tutorial View

1 MapReduce library shards the input files and starts up many copies on

a cluster.2 Master assigns work to workers. There are  map and  reduce tasks.

3 Workers assigned map tasks reads the contents input shard, parsekey-value pairs and pass pairs to  map  function. Intermediate

key-value pairs produced by the  map function are buffered in memory.4 Periodically, buffered pairs are written to disk, partitioned into regions.

Locations of buffered pairs on the local disk are passed to the master.

5 When a  reduce worker has read all intermediate data, it sorts by the

intermediate keys. All occurrences a key are grouped together.6 Reduce workers pass a key and the corresponding set of intermediate

values to the  reduce function.

7 Output of the  reduce function is appended to a final output file foreach reduce partition.

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 16 / 43

Interlude: Solving problems with Map and Reduce

Page 17: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 17/43

MapReduce Example 1 - Word Count - Data

(from Apache Hadoop tutorial)

Example: Word Count

file1:

Hello World Bye World

file2:

Hello Hadoop Goodbye Hadoop

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 17 / 43

Page 18: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 18/43

Interlude: Solving problems with Map and Reduce

Page 19: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 19/43

MapReduce Example 1 - Word Count - Sort and Combine

Example: Word Count

The output of the first map:

< Bye, 1>

< Hello, 1>

< World, 2>

The output of the second map:

< Goodbye, 1>

< Hadoop, 2>< Hello, 1>

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 19 / 43

Interlude: Solving problems with Map and Reduce

Page 20: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 20/43

MapReduce Example 1 - Word Count - Sort and Reduce

Example: Word Count

The Reducer method sums up the values for each key.

The output of the job is:< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 20 / 43

Interlude: Solving problems with Map and Reduce

Page 21: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 21/43

What problems is MapReduce good at solving?

Themes:

Identify, transform, aggregate, filter, count, sort. . .

Requirement of global knowledge of data is (a) “occasional” (vs. cost

of map) (b) confined to ordinalityDiscovery tasks (vs. high repetition of similar transactional tasks,many reads)

Unstructured data (vs. tabular, indexes!)

Continuously updated data (indexing cost)Many, many, many machines (fault tolerance)

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 21 / 43

Interlude: Solving problems with Map and Reduce

Page 22: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 22/43

What problems is MapReduce good at solving?

Memes:

MapReduce  ⇔  SQL (read the comments too)http://www.data-miners.com/blog/2008/01/

 mapreduce-and-sql-aggregations.html

MapReduce vs. Message Passing Interface (MPI) “MPI is good for

task parallelism and Hadoop is good for Data Parallelism.” finitedifferences, finite elements, particle-in-cell. . .

MapReduce vs. column-oriented DBs tabular data, indexes(cantankerous old farts!)   http://databasecolumn.vertica.com/

database-innovation/mapreduce-a-major-step-backwards/

and   http://databasecolumn.vertica.com/

database-innovation/mapreduce-ii/

MapReduce vs. relational DBs   http://scienceblogs.com/

goodmath/2008/01/databases_are_hammers_mapreduc.php

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 22 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

Page 23: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 23/43

Agenda

1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduceExample 1: Streaming Work Flow with AWS Management ConsoleExample 2 - Word count (Slightly more useful)Example 3 - elastic-mapreduce command line tool

4   References and Notes

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 23 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

Page 24: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 24/43

Example 1 - Add up integers

Data

34-14

-311. . .

Mapimport sys

for line in sys.stdin:

print ’%s%s%d’ % ("sum", ’\t’, int(line))

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 24 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

Page 25: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 25/43

Example 1 - Add up integers

Reduceimport sys

sum_of_ints = 0

for line in sys.stdin:

key, value = line.split(’\t’) # key is always the same

try:sum_of_ints += int(value)

except ValueError:

pass

try:

print "%s%s%d" % (key, ’\t’, sum_of_ints)

except NameError: # No items processed

pass

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 25 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

Page 26: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 26/43

Example 1 - Add up integers

Shell test

cat ./input/ints.txt | ./mapper.py > ./inter

cat ./input/ints1.txt | ./mapper.py >> ./inter

cat ./input/ints2.txt | ./mapper.py >> ./intercat ./input/ints3.txt | ./mapper.py >> ./inter

echo "Intermediate output:"

cat ./inter

cat ./inter | sort | \

./reducer.py > ./output/cmdLineOutput.txt

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 26 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

Page 27: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 27/43

Example 1 - Add up integers

What was that comment earlier about an optional combiner?

Combiner in map

import syssum_of_ints = 0

for line in sys.stdin:

sum_of_ints += int(line)

print ’%s%s%d’ % ("sum", ’\t’, sum_of_ints)

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 27 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

Page 28: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 28/43

Example 1 - Add up integers

Combiner shell test

cat ./input/ints.txt | ./mapper_combine.py > ./inter

cat ./input/ints1.txt | ./mapper_combine.py >> ./inter

cat ./input/ints2.txt | ./mapper_combine.py >> ./intercat ./input/ints3.txt | ./mapper_combine.py >> ./inter

echo "Intermediate output:"

cat ./inter

cat ./inter | sort | \

./reducer.py > ./output/cmdLineCombOutput.txt

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 28 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

S C

Page 29: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 29/43

Example 1 - Add up integers - AWS Console

1 Upload  oneCount directory with FFS3

2 Create a New Job FlowName: ”oneCount”Job Flow: Run own appJob Type: Streaming

3 Input: bsi-test/oneCount/inputOutput: bsi-test/oneCount/outputConsole (must not exist)Mapper: bsi-test/oneCount/mapper.py

Reducer: bsi-test/oneCount/reducer.pyExtra Args: none

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 29 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console

E l 1 Add i AWS C l

Page 30: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 30/43

Example 1 - Add up integers - AWS Console

4 Instances: 4Type: smallKeypair: No (Yes allows ssh to Hadoop master)

Log: yesLog Location: bsi-test/oneCount/logHadoop Debug: no

5 No bootstrap actions

6 Start it, and wait. . .

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 30 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)

A d

Page 31: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 31/43

Agenda

1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduceExample 1: Streaming Work Flow with AWS Management ConsoleExample 2 - Word count (Slightly more useful)Example 3 - elastic-mapreduce command line tool

4   References and Notes

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 31 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)

E l 2 W d t

Page 32: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 32/43

Example 2 - Word count

Map

def read_input(file):

for line in file:

yield line.split()

def main(separator=’\t’):

data = read_input(sys.stdin)

for words in data:

for word in words:

lword = word.lower().strip(string.puctuation)

print ’%s%s%d’ % (lword, separator, 1)

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 32 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)

E l 2 W d t

Page 33: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 33/43

Example 2 - Word count

Reduce

def read_mapper_output(file, separator=’\t’):for line in file:

yield line.rstrip().split(separator, 1)

def main(separator=’\t’):

data = read_mapper_output(sys.stdin,separator=separator)

for current_word,group in groupby(data,itemgetter(0)):

try:

total_count = sum(int(count)

for current_word, count in group)

print "%s%s%d" % (current_word,

separator, total_count)

except ValueError:

passScott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 33 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)

E l 2 W d t

Page 34: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 34/43

Example 2 - Word count

Shell test

echo "foo foo quux labs foo bar quux" | ./mapper.py

echo "foo foo quux labs foo bar quux" | ./mapper.py \| sort | ./reducer.py

cat ./input/alice.txt | ./mapper.py \

| sort | ./reducer.py > ./output/cmdLineOutput.txt

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 34 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)

Example 2 Word count AWS Console

Page 35: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 35/43

Example 2 - Word count - AWS Console

1 Upload  myWordCount directory with FFS3

2 Create a New Job FlowName: ”myWordCount”Job Flow: Run own appJob Type: Streaming

3 Input: bsi-test/myWordCount/inputOutput: bsi-test/myWordCount/outputConsole (must not exist)Mapper: bsi-test/myWordCount/mapper.py

Reducer: bsi-test/myWordCount/reducer.pyExtra Args: none

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 35 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)

Example 2 Word count AWS Console

Page 36: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 36/43

Example 2 - Word count - AWS Console

4 Instances: 4Type: smallKeypair: No (Yes allows ssh to Hadoop master)

Log: yesLog Location: bsi-test/myWordCount/logHadoop Debug: no

5 No bootstrap actions

6

Start it, and wait. . .

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 36 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 3 - elastic-mapreduce command line tool

Agenda

Page 37: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 37/43

Agenda

1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduceExample 1: Streaming Work Flow with AWS Management ConsoleExample 2 - Word count (Slightly more useful)Example 3 - elastic-mapreduce command line tool

4   References and Notes

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 37 / 43

Running MapReduce on Amazon Elastic MapReduce   Example 3 - elastic-mapreduce command line tool

Example 3 elastic mapreduce command line tool

Page 38: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 38/43

Example 3 -  elastic-mapreduce  command line tool

Word count (again, only better)/usr/local/emr-ruby/elastic-mapreduce --create \

--stream \

--num-instances 2 \

--name from-elastic-mapreduce \--input s3n://bsi-test/myWordCount/input \

--output s3n://bsi-test/myWordCount/outputRubyTool \

--mapper s3n://bsi-test/myWordCount/mapper.py \

--reducer s3n://bsi-test/myWordCount/reducer.py \

--log-uri s3n://bsi-test/myWordCount/log

/usr/local/emr-ruby/elastic-mapreduce --list

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 38 / 43

References and Notes

Agenda

Page 39: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 39/43

Agenda

1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduceExample 1: Streaming Work Flow with AWS Management ConsoleExample 2 - Word count (Slightly more useful)Example 3 - elastic-mapreduce command line tool

4   References and Notes

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 39 / 43

References and Notes

MapReduce Concepts Links

Page 40: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 40/43

MapReduce Concepts Links

Google MapReduce Tutorial:   http:

//code.google.com/edu/parallel/mapreduce-tutorial.html

Apache Hadoop tutorial:   http://hadoop.apache.org/common/

docs/current/mapred_tutorial.html

Google Code University presentation on MapReduce:   http://code.

google.com/edu/submissions/mapreduce/listing.html

MapReduce framework paper:

http://labs.google.com/papers/mapreduce-osdi04.pdf

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 40 / 43

References and Notes

Amazon Web Services Links

Page 41: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 41/43

Amazon Web Services Links

EMR Getting Started documentation:http://aws.amazon.com/documentation/elasticmapreduce/

Getting started with Amazon S3:   http:

//docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/PIG on EMR:  http:

//s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/

ElasticMapReduce-PigTutorial.html

Boto Python library (multiple Amazon Services):http://code.google.com/p/boto/

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 41 / 43

References and Notes

Machine Learning

Page 42: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 42/43

Machine Learning

Linear speedup (with processor number) for “locally weighted linearregression (LWLR), k-means, logistic regression (LR), naive Bayes

(NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM,and backpropagation (NN)”:   http://www.cs.stanford.edu/

people/ang/papers/nips06-mapreducemulticore.pdf

Mahout framework:   http://mahout.apache.org/

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 42 / 43

References and Notes

Examples Links

Page 43: EMR HadoopMeetup

8/10/2019 EMR HadoopMeetup

http://slidepdf.com/reader/full/emr-hadoopmeetup 43/43

Examples Links

Wordcount example/tutorial:   http://www.michael-noll.com/

wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

CouchDB and MapReduce (interesting examples of MR

implementations for common problems)http://wiki.apache.org/couchdb/View_Snippets

This presentation:http://drskippy.net/projects/EMR-HadoopMeetup.pdf  orpresentation source, example files etc.:

http://drskippy.net/projects/EMR-HadoopMeetup.zip

Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop   8 July 2010 43 / 43