awsgsg emr

7/25/2019 Awsgsg Emr

1/29

Getting Started with AWS

Analyzing Big Data


2/29

Getting Started with AWS: Analyzing Big DataCopyright 2015 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any mannerthat is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks notowned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by

Amazon.

Getting Started with AWS Analyzing Big Data


3/29

Table of Contents

Analyzing Big Data ........................................................................................................................ 1Key AWS Services for Big Data ................................................................................................ 1

Setting Up .................................................................................................................................... 3Sign Up for AWS ................................................................................................................... 3Create a Key Pair .................................................................................................................. 3

Tutorial: Sentiment Analysis ............................................................................................................. 5Step 1: Create a Twitter Developer Account ................................................................................ 6Step 2: Create an Amazon S3 Bucket ....................................................................................... 6Step 3: Collect and Store the Sentiment Data ............................................................................. 7

Launch an Instance Using AWS CloudFormation ................................................................ 7Collect Tweets Using the Instance .................................................................................... 8Store the Tweets in Amazon S3 ...................................................................................... 10

Step 4: Customize the Mapper ............................................................................................... 10Step 5: Create an Amazon EMR Cluster .................................................................................. 11Step 6: Examine the Sentiment Analysis Output ........................................................................ 14Step 7: Clean Up ................................................................................................................. 15

Tutorial: Web Server Log Analysis ................................................................................................... 16

Step 1: Create an Amazon EMR Cluster .................................................................................. 16Step 2: Connect to the Master Node ........................................................................................ 18Step 3: Start and Configure Hive ............................................................................................ 19Step 4: Create the Hive Table and Load Data into HDFS ............................................................. 19Step 5: Query Hive ............................................................................................................... 20Step 6: Clean Up ................................................................................................................. 21

More Big Data Options ................................................................................................................. 23Related Resources ...................................................................................................................... 25

iii



4/29

Analyzing Big Data with Amazon

Web Services

The following tutorials show you ways to use Amazon Web Services to process big data:

Tutorial: Sentiment Analysis (p. 5) How to use Hadoop to evaluate Twitter data

Tutorial: Web Server Log Analysis (p. 16) How to query Apache web server logs using Hive

Key AWS Services for Big DataWith AWS, you pay only for the resources you use. Instead of maintaining a cluster of physical serversand storage devices that are standing by for possible use, you can create resources when you needthem. AWS also supports popular tools like Hadoop, Hive, and Pig, and makes it easy to provision,

configure, and monitor clusters for running those tools.

The following table shows how AWS can help address common big-data challenges.

SolutionChallenge

Amazon S3 can store large amounts of data, andits capacity can grow to meet your needs. It ishighly redundant and secure, protecting againstdata loss and unauthorized use. Amazon S3 alsohas an intentionally small feature set to keep itscosts low.

Data sets can be very large. Storage can becomeexpensive, and data corruption and loss can havefar-reaching implications.

When you run an application on a vir tual Amazon

EC2 server, you pay for the server only while theapplication is running, and you can increase thenumber of servers within minutes, not hours ordays to meet the processing needs of your ap-plication.

Maintaining a cluster of physical servers to process

data is expensive and time-consuming.

1


Key AWS Services for Big Data


5/29

SolutionChallenge

Amazon EMR handles cluster configuration, monit-oring, and management. Amazon EMR also integ-rates open-source tools with other AWS servicesto simplify large-scale data processing, so you can

focus on data analysis and extracting value.

Hadoop and other open-source big-data tools canbe challenging to configure, monitor, and operate.

For more information, see Big Data.

2


Key AWS Services for Big Data
http://aws.amazon.com/big-data/http://aws.amazon.com/big-data/


6/29

Setting Up

Before you use AWS for the first time, complete the following tasks. (These steps prepare you for both

of the tutorials in this guide.)

Tasks

Sign Up for AWS (p. 3)

Create a Key Pair (p.3)

Sign Up for AWSWhen you sign up for Amazon Web Services (AWS), your AWS account is automatically signed up forall services in AWS and you can start using them immediately.You are charged only for the services thatyou use.

If you created your AWS account less than 12 months ago, you can get started with AWS for free. Formore information, see AWS Free Tier.

If you have an AWS account already, skip to the next step. If you don't have an AWS account, use thefollowing procedure to create one.

To create an AWS account

1. Open http://aws.amazon.com/, and then click Sign Up.

2. Follow the on-screen instructions.

Part of the sign-up procedure involves receiving a phone call and entering a PIN using the phonekeypad.

Create a Key PairAWS uses public-key cryptography to secure the login information for your instance. A Linux instancehas no password; you use a key pair to log in to your instance securely. You specify the name of the keypair when you launch your instance, then provide the private key when you log in using SSH.

If you haven't created a key pair already, you can create one using the Amazon EC2 console.

3


Sign Up for AWS
http://aws.amazon.com/free/http://aws.amazon.com/http://aws.amazon.com/http://aws.amazon.com/free/


7/29

To create a key pair

1. Open the Amazon EC2 console.

2. From the navigation bar, in the region selector, click US West (Oregon).

3. In the navigation pane, click Key Pairs.

4. Click Create Key Pair.

5. Enter a name for the new key pair in the Key pair namefield of the Create Key Pairdialog box, andthen click Create. Choose a name that is easy for you to remember.

6. The private key file is automatically downloaded by your browser. The base file name is the nameyou specified as the name of your key pair, and the file name extension is .pem. Save the privatekey file in a safe place.

ImportantThis is the only chance for you to save the private key file.You'll need to provide the nameof your key pair when you launch an instance and the corresponding private key each timeyou connect to the instance.

7. Prepare the private key file.This process depends on the operating system of the computer thatyou're using.

If your computer runs Mac OS X or Linux, use the following command to set the permissions of

your private key file so that only you can read it.

$ chmod 400 my-key-pair.pem

If your computer runs Windows, use the following steps to convert your .pem file to a .ppk filefor use with PuTTY.

a. Download and install PuTTY from http://www.chiark.greenend.org.uk/~sgtatham/putty/. Besure to install the entire suite.

b. Start PuTTYgen (for example, from the Startmenu, click All Programs > PuTTY >PuTTYgen).

c. Under Type of key to generate, select SSH-2 RSA.

d. Click Load. By default, PuTTYgen displays only files with the extension .ppk. To locateyour .pemfile, select the option to display files of all types.

e. Select your private key file and then click Open. Click OKto dismiss the confirmation dialogbox.

f. Click Save private key. PuTTYgen displays a warning about saving the key without apassphrase. Click Yes.

g. Specify the same name that you used for the key pair (for example, my-key-pair) andthen click Save. PuTTY automatically adds the .ppkfile extension.

4


Create a Key Pair
http://www.chiark.greenend.org.uk/~sgtatham/putty/http://www.chiark.greenend.org.uk/~sgtatham/putty/


8/29

Tutorial: Sentiment Analysis

Sentiment analysisrefers to various methods of examining and processing data in order to identify a

subjective response, usually a general mood or a group's opinions about a specific topic. For example,sentiment analysis can be used to gauge the overall positivity toward a blog or a document, or to captureconstituent attitudes toward a political candidate. Sentiment data is often derived from social mediaservices and similar user-generated content, such as reviews, comments, and discussion groups.Thedata sets thus tend to grow large enough to be considered "big data."

Amazon EMR integrates open-source data processing frameworks with the full suite of service from AWS.The resulting architecture is scalable, efficient, and ideal for analyzing large-scale sentiment data, suchas tweets over a given time period.

Suppose your company recently released a new product and you want to assess its reception amongconsumers who use Twitter. In this tutorial, you'll launch an AWS CloudFormation stack that provides ascript for collecting tweets.You'll store the tweets in Amazon S3 and customize a mapper file for use withAmazon EMR.Then you'll create an Amazon EMR cluster that uses a Python natural language toolkit,

implemented with a Hadoop streaming job, to classify the data. Finally, you'll examine the output filesand evaluate the aggregate sentiment of the tweets.

This tutorial typically takes less than an hour to complete.You pay only for the resources you use. Thetutorial includes a cleanup step to help ensure that you don't incur additional costs.

Prerequisites

Before you begin, make sure that you've completed the steps in Setting Up (p.3).

Tasks

Step 1: Create a Twitter Developer Account (p. 6)

Step 2: Create an Amazon S3 Bucket (p. 6)

Step 3: Collect and Store the Sentiment Data (p. 7)

Step 4: Customize the Mapper (p.10)

Step 5: Create an Amazon EMR Cluster (p. 11)

Step 6: Examine the Sentiment Analysis Output (p. 14)

Step 7: Clean Up (p. 15)

5



9/29

Step 1: Create a Twitter Developer AccountIn order to collect tweets for analysis, you'll need to create an account on the Twitter developer site andgenerate credentials for use with the Twitter API.

To create a Twitter developer account

1. Go to https://dev.twitter.com/user/login and log in with your Twitter user name and password. If youdo not yet have a Twitter account, click the Sign uplink that appears under the Usernamefield.

2. If you have not yet used the Twitter developer site, you'll be prompted to authorize the site to useyour account. Click Authorize appto continue.

3. Go to the Twitter applications page at https://dev.twitter.com/appsand click Create a new application.

4. Follow the on-screen instructions. For the application Name, Description, and Website, you canspecify any text you're simply generating credentials to use with this tutorial, rather than creatinga real application.

5. Twitter displays the details page for your new application. Click the Key and AccessTokens tab collect your Twitter developer credentials. You'll see a Consumer keyandConsumer secret. Make a note of these values; you'll need them later in this tutorial. You may want

to store your credentials in a text file.6. At the bottom of the page click Create my access token. Make a note of the Access tokenand

Access token secretvalues that appear, or add them to the text file you created in the precedingstep.

If you need to retrieve your Twitter developer credentials at any point, go to https://dev.twitter.com/appsand select the application that you created for the purposes of this tutorial.

Step 2: Create an Amazon S3 BucketAmazon EMR jobs typically use Amazon S3 buckets for input and output data files, as well as for any

mapper and reducer files that aren't provided by open-source tools. For the purposes of this tutorial, you'llcreate an Amazon S3 bucket in which you'll store the data files and a custom mapper.

To create an Amazon S3 bucket using the console

1. Open the Amazon S3 console.

2. Click Create Bucket.

3. In the Create Bucketdialog box, do the following:

a. Specify a name for your bucket, such asmysentimentjob. To meet Hadoop requirements,your bucket name is restricted to lowercase letters, numbers, periods (.), and hyphens (-).

b. For the region, select US Standard.

c. Click Create.

4. Select your new bucket from the All Bucketslist and click Create Folder. In the text box, specifyinputas the folder name, and then press Enter or click the check mark.

5. Repeat the previous step to create a folder named mapperat the same level as the inputfolder.

6. For the purposes of this tutorial (to ensure that all services can use the folders), make the folderspublic as follows:

a. Select the inputand mapperfolders.

6


Step 1: Create a Twitter Developer Account
https://dev.twitter.com/user/loginhttps://dev.twitter.com/appshttps://dev.twitter.com/appshttps://dev.twitter.com/appshttps://dev.twitter.com/appshttps://dev.twitter.com/user/login


10/29

b. Click Actionsand then click Make Public.

c. When prompted for confirmation, click OK.

Step 3: Collect and Store the Sentiment DataIn this step, you'll use an AWS CloudFormation template to launch an instance, and then use acommand-line tool on the instance to collect the Twitter data.You'll also use a command-line tool on theinstance to store the data in the Amazon S3 bucket that you created.

Tasks

Launch an Instance Using AWS CloudFormation (p. 7)

Collect Tweets Using the Instance (p. 8)

Store the Tweets in Amazon S3 (p. 10)

Launch an Instance Using AWS CloudFormationWe have provided a simple template that launches a single instance using an Amazon Linux AMI, thekey pair that you created, and a security group that grants everyone (0.0.0.0/0) access using SSH.

To launch the AWS CloudFormation stack

1. Open the AWS CloudFormation console.

2. Make sure that US East (N.Virginia)is selected in the region selector of the navigation bar.

3. Click Create Stack.

4. On the Select Templatepage, do the following:

a. Under Stack, in the Namebox, specify a name that is easy for you to remember. For example,my-sentiment-stack.

b. Under Template, select Specify an Amazon S3 template URL, and specifyhttps://s3.amazonaws.com/awsdocs-tutorials/sentiment-analysis/sentimentGSG.template

in the text box.

c. Click Next.

7


Step 3: Collect and Store the Sentiment Data


11/29

5. On the Specify Parameterspage, do the following:

a. In the KeyPairbox, specify the name of the key pair that you created in Create a Key Pair (p. 3).Note that this key pair must be in the US East (N. Virginia) region.

b. In the TwitterConsumerKey, TwitterConsumerSecret, TwitterToken, and TwitterTokenSecretboxes, specify your Twitter credentials. For best results, copy and paste the Twitter credentialsfrom the Twitter developer site or the text file you saved them in.

NoteThe order of the Twitter credential boxes on the Specify Parameterspage may notmatch the display order on the Twitter developer site.Verify that you're pasting thecorrect value in each box.

c. Click Next.

6. On the Optionspage, click Next.

7. On the Reviewpage, select the I acknowledge that this template might cause AWSCloudFormation to create IAM resourcescheck box, and then click Createto launch the stack.

8. Your new AWS CloudFormation stack appears in the list with its status set toCREATE_IN_PROGRESS.

NoteStacks take several minutes to launch. To see whether this process is complete, clickRefresh. When your stack status is CREATE_COMPLETE, it's ready to use.

Collect Tweets Using the InstanceThe instance has been preconfigured with Tweepy, an open-source package for use with the Twitter API.Python scripts for running Tweepy appear in the sentimentdirectory.

NoteYou may encounter an issueinstalling Tweepy. If you do encounter this issue, edit setup.pyas follows.Remove the following:

8


Collect Tweets Using the Instance
https://github.com/tweepy/tweepyhttps://bugzilla.mozilla.org/show_bug.cgi?id=1015934https://bugzilla.mozilla.org/show_bug.cgi?id=1015934https://github.com/tweepy/tweepy


12/29

install_reqs = parse_requirements('requirements.txt', ses

sion=uuid.uuid1())

reqs = [str(req.req) for req in install_reqs]

Add the following:

import os

from setuptools import setup

with open('requirements.txt') as f: reqs = f.read().splitlines()

To collect tweets using your AWS CloudFormation stack

1. Select the Outputstab. Copy the DNS name of the Amazon EC2 instance that AWS CloudFormationcreated from the EC2DNS key.

2. Connect to the instance using SSH. Specify the name of your key pair and the user nameec2-user.For more information, see Connect to Your Linux Instance.

3. In the terminal window, run the following command:

$ cd sentiment

4. To collect tweets, run the following command, whereterm1is your search term. Note that the collector

script is not case sensitive.To use a multi-word term, enclose it in quotation marks.

$ python collector.py term1

Examples:

$ python collector.py kindle

$ python collector.py "kindle fire"

5. Press Enterto run the collector script.Your terminal window displays the following message:

Collecting tweets. Please wait.

NoteIf your SSH connection is interrupted while the script is still running, reconnect to the instanceand run the script with nohup(for example, nohup python collector.py > /dev/null&

).

The script collects 500 tweets, which could take several minutes. If you're searching for a subject that isnot currently popular on Twitter (or if you edited the script to collect more than 500 tweets), the script willtake longer to run. You can interrupt it at any time by pressing Ctrl+C.

When the script has finished running, your terminal window displays the following message:

Finished collecting tweets.

9


Collect Tweets Using the Instance
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.htmlhttp://www.gnu.org/software/coreutils/manual/html_node/nohup-invocation.htmlhttp://www.gnu.org/software/coreutils/manual/html_node/nohup-invocation.htmlhttp://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.html


13/29

Store the Tweets in Amazon S3

Your sentiment analysis stack has been preconfigured with s3cmd, a command-line tool for Amazon S3.You'll use s3cmd to store your tweets in the bucket that you created in Step 2: Create an Amazon S3Bucket (p. 6).

To store the collected tweets in your bucket

1. In your SSH window, run the following command. (The current directory should still be sentiment.If it's not, use cdto navigate to the sentimentdirectory.)

$ ls

You should see a file named tweets.date-time.txt, where dateand timereflect when thescript was run. This file contains the ID numbers and full text of the tweets that matched your searchterms.

2. To copy the Twitter data to Amazon S3, run the following command, where tweet-fileis the fileyou identified in the previous step and your-bucketis the name of your bucket.

ImportantBe sure to include the trailing slash, to indicate that inputis a folder. Otherwise, Amazon

S3 will create an object called inputin your base S3 bucket.

$ s3cmd put tweet-files3://your-bucket/input/

For example:

$ s3cmd put tweets.Nov12-1227.txt s3://mysentimentjob/input/

3. To verify that the file was uploaded to Amazon S3, run the following command:

$ s3cmd ls s3://your-bucket/input/

You can also use the Amazon S3 console to view the contents of your bucket and folders.

Step 4: Customize the MapperWhen you create your own Hadoop streaming programs, you'll need to write mapper and reducerexecutables as described in Process Data with a Streaming Clusterin the Amazon Elastic MapReduceDeveloper Guide. For this tutorial, we've provided a mapper script that you can customize for use withyour Twitter search term.

To customize the mapper

1. On your computer, open a text editor. Copy and paste the following script into a new file, and savethe file as sentiment.py.

#!/usr/bin/python

import cPickle as pickle

import nltk.classify.util

10


Store the Tweets in Amazon S3
https://github.com/s3tools/s3cmdhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UseCase_Streaming.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UseCase_Streaming.htmlhttps://github.com/s3tools/s3cmd


14/29

from nltk.classify import NaiveBayesClassifier

from nltk.tokenize import word_tokenize

import sys

sys.stderr.write("Started mapper.\n");

def word_feats(words):

return dict([(word, True) for word in words])

def subj(subjLine):

subjgen = subjLine.lower()

# Replace term1with your subject term

subj1 = "term1"

if subjgen.find(subj1) != -1:

subject = subj1

return subject

else:

subject = "No match"

return subject

def main(argv):

classifier = pickle.load(open("classifier.p", "rb"))

for line in sys.stdin:

tolk_posset = word_tokenize(line.rstrip())

d = word_feats(tolk_posset)

subjectFull = subj(line)

if subjectFull == "No match":

print "LongValueSum:" + " " + subjectFull + ": " + "\t" + "1"

else:

print "LongValueSum:" + " " + subjectFull + ": " + classifi

er.classify(d) + "\t" + "1"

if __name__ == "__main__":

main(sys.argv)

2. Replace term1with the search term that you used in Step 3: Collect and Store the SentimentData (p. 7)and save the file.

ImportantDo not change the spacing in the file. Incorrect indentation causes the Hadoop streamingprogram to fail.

3. Open the Amazon S3 console and locate the mapperfolder that you created in Step 2: Create an

Amazon S3 Bucket (p. 6).

4. Click Uploadand follow the on-screen instructions to upload your customized sentiment.pyfile.5. Select the file, click Actions, and then click Make Public.

Step 5: Create an Amazon EMR ClusterYou can use Amazon EMR to configure a cluster with software, bootstrap actions, and work steps. Forthis tutorial, you'll run a Hadoop streaming program.When you configure a cluster with a Hadoop streaming

11


Step 5: Create an Amazon EMR Cluster


15/29

program in Amazon EMR, you specify a mapper and a reducer, as well as any supporting files. Thefollowing list provides a summary of the files you'll use for this tutorial.

For the mapper, you'll use the file you customized in the preceding step.

For the reducer method, you'll use the predefined Hadoop package aggregate. For more informationabout the aggregate package, see the Hadoop documentation.

Sentiment analysis usually involves some form of natural language processing. For this tutorial, you'lluse the Natural Language Toolkit(NLTK), a popular Python platform. You'll use an Amazon EMRbootstrap action to install the NLTK Python module. Bootstrap actions load custom software onto theinstances that Amazon EMR provisions and configures. For more information, see Create BootstrapActionsin the Amazon Elastic MapReduce Developer Guide.

Along with the NLTK module, you'll use a natural language classifier file that we've provided in anAmazon S3 bucket.

For the job's input data and output files, you'll use the Amazon S3 bucket you created (which nowcontains the tweets you collected).

Note that the files used in this tutorial are for illustration purposes only.When you perform your ownsentiment analysis, you'll need to write your own mapper and build a sentiment model that meets yourneeds. For more information about building a sentiment model, see Learning to Classify Textin NaturalLanguage Processing with Python, which is provided for free on the NLTK site.

To create a cluster

1. Open the Amazon EMR console.

2. Choose Create cluster.

3. Choose Go to advanced options.

4. Specify a name for the cluster, or leave the default value ofMy cluster. Set Termination protectionto Noand clear the Logging Enabledcheck box.

NoteIn a production environment, logging and debugging can be useful tools for analyzing errorsor inefficiencies in Amazon EMR steps or programs. For more information, seeTroubleshootingin the Amazon Elastic MapReduce Developer Guide.

5. Under Software Configuration, leave the default Hadoop distributionsetting: Amazonand selectthe latest 3.xAMI version. Under Applications to be installed, click each X to remove Hive, Pig,and Hue from the list.

6. Under File System Configurationand Hardware Configuration, leave the default settings.7. Under Security and Access, select your key pair from EC2 key pair. Leave the default IAM user

access. If this is your first time using Amazon EMR, click Create Default Roleto create the defaultEMR role and then click Create Default Roleto create the default EC2 instance profile.

12


http://hadoop.apache.org/docs/r1.1.2/streaming.html#Hadoop+Aggregate+Packagehttp://nltk.org/http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.htmlhttp://nltk.org/book/ch06.htmlhttp://nltk.org/book/http://nltk.org/book/http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/DebuggingJobFlows.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/DebuggingJobFlows.htmlhttp://nltk.org/book/http://nltk.org/book/http://nltk.org/book/ch06.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.htmlhttp://nltk.org/http://hadoop.apache.org/docs/r1.1.2/streaming.html#Hadoop+Aggregate+Package


16/29

8. Under Bootstrap Actions, do the following:

a. In the Add bootstrap actionlist, select Custom action.You'll add a custom action that installsand configures the Natural Language Toolkit on the cluster.

b. Click Configure and add.c. In the Add Bootstrap Actiondialog box, in the Amazon S3 locationbox, copy and paste the

following path:

s3://awsdocs-tutorials/sentiment-analysis/config-nltk.sh

Alternatively, copy and paste the following script into a new file, upload that file to an AmazonS3 bucket, and use that location instead:

#!/bin/bash

sudo yum -y install git gcc python-dev python-devel

sudo pip install -U numpy

sudo pip install pyyaml nltksudo pip install -e git://github.com/mdp-toolkit/mdp-toolkit#egg=MDP

sudo python -m nltk.downloader -d /usr/share/nltk_data all

d. Click Add. The final result is displayed as follows.

9. Under Steps, do the following:

a. Set Auto-terminateto Yes.

b. In the Add steplist, select Streaming program, and then click Configure and add.

13




17/29

c. In the Add Stepdialog box, configure the job as follows, replacing your-bucketwith the nameof the Amazon S3 bucket you created earlier, and then click Add:

Set Nameto Sentiment analysis

Set Mapperto s3://your-bucket/mapper/sentiment.py

Set Reducerto aggregate

Set Input S3 locationto s3://your-bucket/input/

Set Output S3 locationto s3://your-bucket/output/ (note that this folder must not exist)

Set Argumentsto -cacheFile s3://awsdocs-tutorials/sentiment-analysis/classifier.p#classifier.p

Set Action on failureto Continue

10. At the bottom of the page, click Create cluster.

A summary of your new cluster appears, with the status Starting. It takes a few minutes for AmazonEMR to provision the EC2 instances for your cluster.

Step 6: Examine the Sentiment Analysis OutputWhen your cluster's status in the Amazon EMR console is Waiting: Waiting after step completed, youcan examine the results.

To examine the streaming program results

1. Open the Amazon S3 console and locate the bucket you created in Step 2: Create an Amazon S3Bucket (p. 6).You should see a new outputfolder in your bucket.You might need to click the

refresh arrow in the top right corner to see the new bucket.

2. The job output is split into several files: an empty status file named_SUCCESSand severalpart-xxxxxfiles.The part-xxxxxfiles contain sentiment measurements generated by the Hadoop

streaming program.

3. Select an output file, click Actions, and then click Download. Right-click the link in the pop-up windowand click Save Link Asto download the file.

Repeat this step for each output file.

4. Open the files in a text editor.You'll see the total number of positive and negative tweets for yoursearch term, as well as the total number of tweets that did not match any of the positive or negative

14


Step 6: Examine the Sentiment Analysis Output


18/29

terms in the classifier (usually because the subject term was in a different field, rather than in theactual text of the tweet).

For example:

kindle: negative 13

kindle: positive 479No match: 8

In this example, the sentiment is overwhelmingly positive. In most cases, the positive and negativetotals are closer together. For your own sentiment analysis work, you'll want to collect and comparedata over several time periods, possibly using several different search terms, to get as accurate ameasurement as possible.

Step 7: Clean UpTo prevent your account from accruing additional charges, you should clean up the AWS resources thatyou created for this tutorial.

To delete the AWS CloudFormation stack

1. Open the AWS CloudFormation console.

2. Select your sentiment stack and click Delete Stack.

3. When prompted for confirmation, click Yes, Delete.

Because you ran a Hadoop streaming program and set it to auto-terminate after running the steps in theprogram, the cluster should have been automatically terminated when processing was complete.

To verify that the cluster was terminated


2. Click Cluster List

3. In the cluster list, the Statusof your cluster should be Terminated. Otherwise, select the cluster andclick Terminate.

15


Step 7: Clean Up


19/29

Tutorial: Web Server Log Analysis

Suppose you host a popular e-commerce website and you want to analyze your Apache web logs to see

how people find your site.You want to determine which of your online ad campaigns drive the most trafficto your online store.

The web server logs, however, are too large to import into a MySQL database, and they are not in arelational format.You need another way to analyze them.

Amazon EMR integrates open-source applications such as Hadoop and Hive with Amazon Web Servicesto provide a scalable and efficient architecture for analyzing large-scale data, such as Apache web logs.

In this tutorial, we'll import data from Amazon S3 and create an Amazon EMR cluster.Then we'll connectto the master node of the cluster, where we'll run Hive to query the Apache logs using a simplified SQLsyntax. To learn more about Hive, see http://hive.apache.org/.

This tutorial typically takes less than an hour to complete.You pay only for the resources you use, andthe tutorial includes a clean-up step to help ensure that you don't incur unnecessary costs.

Prerequisites

Before you begin, make sure you've completed the steps in Setting Up (p.3).

Tasks

Step 1: Create an Amazon EMR Cluster (p. 16)

Step 2: Connect to the Master Node (p. 18)

Step 3: Start and Configure Hive (p. 19)

Step 4: Create the Hive Table and Load Data into HDFS (p. 19)

Step 5: Query Hive (p. 20)

Step 6: Clean Up (p. 21)

Step 1: Create an Amazon EMR ClusterYou can use Amazon EMR to configure a cluster with software, bootstrap actions, and work steps. Forthis tutorial, you'll run a Hadoop streaming program.

16


http://hive.apache.org/http://hive.apache.org/


20/29

To create a cluster


2. Choose Create cluster.

3. Choose Go to advanced options.

4. Specify a Cluster nameor leave the default value ofMy cluster. Set Termination protectiontoNoand clear the Logging Enabledcheck box.

NoteIn a production environment, logging and debugging can be useful tools for analyzing errorsor inefficiencies in Amazon EMR steps or programs. For more information, see

Troubleshootingin the Amazon Elastic MapReduce Developer Guide.

5. Under Software Configuration, leave the default Hadoop distributionsetting, Amazon, and thelatest AMI version. Under Applications to be installed, leave the default Hivesettings. Click theX to remove Pig from the list.

6. Under Hardware Configuration, leave the default settings.

NoteWhen you analyze data in a real application, you might want to increase the size or numberof these nodes to improve processing power and reduce computational time.You may also

17


http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/DebuggingJobFlows.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/DebuggingJobFlows.html


21/29

want to use spot instances to further reduce your costs. For more information, see LoweringCosts with Spot Instancesin the Amazon Elastic MapReduce Developer Guide.

7. Under Security and Access, select your key pair from EC2 key pair. Leave the default IAM useraccess. If this is your first time using Amazon EMR, click Create Default Roleto create the defaultEMR role and then click Create Default Roleto create the default EC2 instance profile.

8. Leave the default Bootstrap Actionsand Stepssettings. Bootstrap actions and steps allow you tocustomize and configure your application. For this tutorial, we are using Hive, which is already includedin our configuration.

9. At the bottom of the page, click Create cluster.

When the summary of your new cluster appears, the status is Starting. It takes a few minutes forAmazon EMR to provision the Amazon EC2 instances for your cluster and change the status toWaiting.

Step 2: Connect to the Master NodeWhen the status of the cluster is Waiting, you can connect the master node. The way that you connectdepends on the operating system of the computer that you're using. For step-by-step directions byoperating system, click SSH.

When you've successfully connected to the master node, you'll see a welcome message and promptsimilar to the following:

-----------------------------------------------------------------------------

Welcome to Amazon Elastic MapReduce running Hadoop and Amazon Linux.

Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check

/mnt/var/log/hadoop/steps for diagnosing step failures.

The Hadoop UI can be accessed via the following commands:

18


Step 2: Connect to the Master Node
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_SpotInstances.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_SpotInstances.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_SpotInstances.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_SpotInstances.html


22/29

ResourceManager lynx http://ip-172-16-43-158:9026/

NameNode lynx http://ip-172-16-43-158:9101/

-----------------------------------------------------------------------------

[hadoop@ip-172-16-43-158 ~]$

Step 3: Start and Configure HiveApache Hive is a data warehouse application you can use to query Amazon EMR cluster data with aSQL-like language. Because we select Hive when we configured the cluster, it's ready to use on themaster node.

To use Hive interactively to query the web server log data, you'll need to load some additional libraries.The additional libraries are contained in a Java archive file named hive_contrib.jaron the masternode. When you load these libraries, Hive bundles them with the map-reduce job that it launches toprocess your queries.

To start and configure Hive on the master node

1. In the terminal window for the master node, at the hadoopprompt, run the following command.

[hadoop@ip-172-16-43-158 ~]$ hive

TipIf the hive command is not found, be sure that you specified hadoopas the user name when

connecting to the master node, not ec2-user. Otherwise, close this connection and connectto the master node again.

2. At the hive>prompt, run the following command.

hive> add jar /home/hadoop/hive/lib/hive_contrib.jar;

TipIf /home/hadoop/hive/hive_contrib.jaris not found, it's possible that there's aproblem with the AMI that you selected. Follow the directions in Clean Up (p. 21)and thenstart this tutorial again, using a different AMI version.

When the command completes, you'll see a confirmation message similar to the following:

Added /home/hadoop/hive/lib/hive_contrib.jar to class path

Added resource: /home/hadoop/hive/lib/hive_contrib.jar

Step 4: Create the Hive Table and Load Data intoHDFS

In order for Hive to interact with data, it must translate the data from its current format (in the case ofApache web logs, a text file) into a format that can be represented as a database table. Hive does this

19


Step 3: Start and Configure Hive


23/29

translation using a serializer/deserializer (SerDe). SerDes exist for a variety of data formats. For informationabout how to write a custom SerDe, see the Apache Hive Developer Guide.

The SerDe we'll use in this example uses regular expressions to parse the log file data. It comes fromthe Hive open-source community. Using this SerDe, we can define the log files as a table, which we'llquery using SQL-like statements later in this tutorial. After Hive has loaded the data, the data persists in

HDFS storage as long as the Amazon EMR cluster is running, even if you shut down your Hive sessionand close the SSH connection.

To translate the Apache log file data into a Hive table

1. Copy the following multi-line command.

CREATE TABLE serde_regex(

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING, referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^

\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^

\"]*|\"[^\"]*\"))?",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"

)

LOCATION 's3://elasticmapreduce/samples/pig-apache/input/';

2. At the hivecommand prompt, paste the command (use Ctrl+Shift+V in a terminal window or right-clickin a PuTTY window), and then press Enter.

When the command finishes, you'll see a message like this one:

OK

Time taken: 12.56 seconds

Note that the LOCATIONparameter specifies the location of a set of sample Apache log files in AmazonS3. To analyze your own Apache web server log files, replace the URL in the example command withthe location of your own log files in Amazon S3.

Step 5: Query HiveYou can use Hive to query the Apache log file data. Hive translates your query into a Hadoop MapReducejob and runs it on the Amazon EMR cluster. Status messages appear as the Hadoop job runs.

Hive SQL is a subset of SQL; if you know SQL, you'll be able to easily create Hive queries. For moreinformation about the query syntax, see the Hive Language Manual.

The following are some example queries.

20


Step 5: Query Hive
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-DeveloperGuidehttps://cwiki.apache.org/Hive/languagemanual.htmlhttps://cwiki.apache.org/Hive/languagemanual.htmlhttps://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-DeveloperGuide


24/29

Example 1: Count the number of rows in the Apache webserver log files

select count(1) from serde_regex;

After the query finishes, you'll see output similar to the following:

Total MapReduce CPU Time Spent: 13 seconds 860 msec

OK

239344

Time taken: 86.92 seconds, Fetched: 1 row(s)

Example 2: Return all fields from one row of log file data

select * from serde_regex limit 1;


OK66.249.67.3 - - 20/Jul/2009:20:12:22 -0700] "GET /gal

lery/main.php?g2_controller=exif.SwitchDetailMode

&g2_mode=detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15741&g2_return

Name=photo HTTP/1.1" 302 5

"-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


Example 3: Count the number of requests from the host with an IP address of 192.168.1.198

select count(1) from serde_regex where host="192.168.1.198";


Total MapReduce CPU Time Spent: 13 seconds 870 msec

OK

46


Step 6: Clean UpTo prevent your account from accruing additional charges, clean up the following AWS resources thatyou created for this tutorial.

To disconnect from the master node

1. In your terminal or PuTTY window, press CTRL+C to exit Hive.

2. At the SSH command prompt, run exit.

3. Close the terminal or PuTTY window.

21


Step 6: Clean Up


25/29

To terminate the cluster


2. Click Cluster List.

3. Select the cluster name, and then click Terminate.When prompted for confirmation, click Terminate.

22


Step 6: Clean Up


26/29

More Big Data Options

The tutorials in this guide are just two examples of how you can use Amazon EMR to work with big data.

This page provides other options that you might want to explore.

Analyze Sentiment by Location in Twitter

We analyzed only the content of the tweets.You might want to collect geographic data, creating datasets for specific regions or ZIP codes in order to analyze sentiment by location.

For more information about the geographic aspects of Twitter data, see the Twitter APIresourcedocumentation, such as the description of the reverse_geocode resource.

Explore Corpora and Data

The Natural Language Toolkit includes several corpora, ranging from Project Gutenberg selections topatient information leaflets.The Stanford Large Network Dataset Collectionincludes various types ofsentiment data, as well as Amazon product review data. A wealth of movie review datais available

from Cornell.You may find that working with a more focused set of data, rather than general Twitterdata, yields more accurate results.

Try Other Classifier Models

We applied a Naive Bayesian classifierto sentiment data. The Natural Language Toolkit supportsseveral classification algorithms, including maxent (maximum entropy)and support vector machines(SVMs) through the scikitlearnmodule. Depending on the type of data you're analyzing and the featuresyou choose to evaluate, other algorithms may produce better output. For more information, read Learningto Classify Textin the book Natural Language Processing with Python.

Script Your Hive Queries

Interactively querying data is the most direct way to get results, and interactive queries can help youexplore data and refine your approach. After you've created a set of queries that you want to runregularly, you can automate the process by saving your Hive commands as a script and uploading the

script to Amazon S3. For more information on how to launch a cluster using a Hive script, see Launcha Hive Clusterin the Amazon Elastic MapReduce Developer Guide.

Use Pig Instead of Hive to Analyze Your Data

Amazon EMR provides access to many open-source tools, including Pig, which uses a language calledPig Latin to abstract map-reduce jobs. For an example of how to analyze log files with Pig, see ParsingLogs with Apache Pig and Amazon Elastic MapReduce.

Create Custom Applications to Analyze Your Data

23

https://dev.twitter.com/docs/apihttps://dev.twitter.com/docs/api/1.1/get/geo/reverse_geocodehttp://nltk.org/data.htmlhttp://snap.stanford.edu/data/http://www.cs.cornell.edu/people/pabo/movie-review-data/http://nltk.org/api/nltk.classify.html#module-nltk.classify.naivebayeshttp://nltk.org/api/nltk.classify.html#nltk.classify.maxent.MaxentClassifierhttp://nltk.org/api/nltk.classify.html#module-nltk.classify.scikitlearnhttp://nltk.org/book/ch06.htmlhttp://nltk.org/book/ch06.htmlhttp://nltk.org/book/http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreatingaJobFlowUsingHive.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreatingaJobFlowUsingHive.htmlhttp://aws.amazon.com/articles/Elastic-MapReduce/2729http://aws.amazon.com/articles/Elastic-MapReduce/2729http://aws.amazon.com/articles/Elastic-MapReduce/2729http://aws.amazon.com/articles/Elastic-MapReduce/2729http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreatingaJobFlowUsingHive.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreatingaJobFlowUsingHive.htmlhttp://nltk.org/book/http://nltk.org/book/ch06.htmlhttp://nltk.org/book/ch06.htmlhttp://nltk.org/api/nltk.classify.html#module-nltk.classify.scikitlearnhttp://nltk.org/api/nltk.classify.html#nltk.classify.maxent.MaxentClassifierhttp://nltk.org/api/nltk.classify.html#module-nltk.classify.naivebayeshttp://www.cs.cornell.edu/people/pabo/movie-review-data/http://snap.stanford.edu/data/http://nltk.org/data.htmlhttps://dev.twitter.com/docs/api/1.1/get/geo/reverse_geocodehttps://dev.twitter.com/docs/api


27/29

If you're not able to find an open-source tool that meets your needs, you can write a custom Hadoopmap-reduce application and run it on Amazon EMR. For more information, see Run a Hadoop Applicationto Process Datain the Amazon Elastic MapReduce Developer Guide.

Alternatively, you can create a Hadoop streaming job that reads data from standard input. For anexample, see Tutorial: Sentiment Analysis (p. 5)in this guide. For more details, see Launch a Streaming

Clusterin the Amazon Elastic MapReduce Developer Guide.

24

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-application.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-application.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreateStreaming.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreateStreaming.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreateStreaming.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreateStreaming.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-application.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-application.html


28/29

Related Resources

The following table lists some of the AWS resources that you'll find useful as you work with AWS.

DescriptionResource

Information about the products and services that AWS offers.AWS Products & Services

Official documentation for each AWS product, including ser-vice introductions, service features, and API reference.

AWS Documentation

Community-based forums for discussing technical questionsabout Amazon Web Services.

AWS Discussion Forums

A central contact point for account questions such as billing,events, and abuse. For technical questions, use the forums.

Contact Us

The hub for creating and managing your AWS Support cases.

Also includes links to other helpful resources, such as forums,technical FAQs, service health status, and AWS Trusted Ad-visor.

AWS Support Center

The home page for AWS Support, a one-on-one, fast-re-sponse support channel to help you build and run applicationsin the cloud.

AWS Support

Provides the necessary guidance and best practices to buildhighly scalable and reliable applications in the AWS cloud.These resources help you understand the AWS platform, itsservices and features.They also provide architectural guid-ance for design and implementation of systems that run onthe AWS infrastructure.

AWS Architecture Center

Provides information about security features and resources.AWS Security Center

Provides access to information, tools, and resources to com-pare the costs of Amazon Web Services with IT infrastructurealternatives.

AWS Economics Center

Provides technical whitepapers that cover topics such as ar-chitecture, security, and economics.These whitepapers havebeen written by the Amazon team, customers, and solutionproviders.

AWS Technical Whitepapers

25

http://aws.amazon.com/products/http://aws.amazon.com/documentation/https://forums.aws.amazon.com/http://aws.amazon.com/contact-us/https://console.aws.amazon.com/support/home#/http://aws.amazon.com/premiumsupport/http://aws.amazon.com/architecture/http://aws.amazon.com/security/http://aws.amazon.com/economics/http://aws.amazon.com/whitepapers/http://aws.amazon.com/whitepapers/http://aws.amazon.com/economics/http://aws.amazon.com/security/http://aws.amazon.com/architecture/http://aws.amazon.com/premiumsupport/https://console.aws.amazon.com/support/home#/http://aws.amazon.com/contact-us/https://forums.aws.amazon.com/http://aws.amazon.com/documentation/http://aws.amazon.com/products/


29/29

DescriptionResource

Provides blog posts that cover new services and updates toexisting services.

AWS Blogs

Provides podcasts that cover new services, existing services,

and tips.

AWS Podcast

http://aws.amazon.com/blogs/aws/http://aws.amazon.com/podcasts/aws-podcast/http://aws.amazon.com/podcasts/aws-podcast/http://aws.amazon.com/blogs/aws/

awsgsg emr

Documents