apache hadoop daniel lust, anthony taliercio. what is apache hadoop? allows applications to utilize...
TRANSCRIPT
Apache HadoopDaniel Lust , Anthony Taliercio
What is Apache Hadoop?
Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task
Supports distributed applications under a free license
Used by many popular companiesSuch as: Facebook, Twitter, Ebay, IBM, Apple, Microsoft, Hewlett-Packard, and many others…
Continued…Written in Java
Scales wellCan be used with thousands of nodes
Can be used with just a few nodes and inexpensive hardware
Your average Hadoop cluster will consist of two major parts
A single master node and multiple working nodes. The master node is made up of four parts: the Job Tracker, Task Tracker, NameNode, and DataNode.
A worker node, which is also known as a slave node, can either be a DataNode and TaskTracker or just one of the two.
Overview Of Hadoop
- Hadoop uses whats called an HDFSHadoop Distributed File System
HDFS takes files and splits them across the network redundantly in a cluster
The redundancy to eliminate possible data loss
MapReduce
MapReduceSoftware wrote by google to process massive amounts of unstructured data in a parallel process across a distributed cluster of processors
MapReduce.
Offers a clean abstraction between data analysis tasks, organizing the jobs Issued by the HDFS, so no jobs are unnecessarily repeated.
- If one of them fail, a node may point to a different node to complete the task
Running Hadoop
First run of Hadoop on Master ComputerVarious processes are started including:
TaskTracker
JobTracker
DataNode
Secondary Node
NameNode It also makes a connection through SSH to other SLAVE computers to start a DataNode and TaskTracker
Running Hadoop
Used Hadoop to do a word count on six different books.
HDFS copied the books to different clusters, and ran a pre-written program to do a word count on the books.
Each node returned data, using the DataNode proccess to save its results.
When a node failed, it will issue the job to another node
Example Output of Job Processes
Word count Output
Tested on 1-3 Nodes
1 NODE: JOB COMPLETION 00:01:45
2 NODES: JOB COMPLETION
00:01:28
3 NODE : JOB COMPLETION
00:01:00
Conclusion
Our guide covered everything you need to get started with Apache Hadoop
Although, there are many problems you can see along the way
Troubleshooting was a large part of our project