bigdata & hadoop

Big Data and HadoopData Facts:-

The New York Stock Exchange generates about 1 TB of trade data per day. Facebook hosts approximately 10 billion of photos, taking up one petabyte of storage. Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. 8 TB generated per day by Twitter. The internet Archive Stores around 2 petabytes of data, and is growing at a rate of 20

terabytes per month. The Large Hardon Collider near Geneva, Switzerland will produce about 15 petabytes of

data per year.

Big Data:-

It is commonly summarizeas 3Vs of data. Though there is another V which is also equally important. They are as follows:-

Volume: - This clearly tells about the total size of data which could be in TB or PB or Zettabytes of data which happens to be semi or multi-structure.

Variety: - Mostly generated data are messy because diverse data sources do not provide a static structure enabling the traditional RDBMS timely manage.

Velocity: - It is the speed at which data is collected i.e. the rate at which the data is becoming available to the organization and do the analysis of streaming data to enable decision within very short time frame.

Veracity: - It is the uncertainty about the genuineness of huge data which is being generated.

Pic: - Different levels of data generation

Market trends is having New Set of Questions like:-

Social and Web Analytics:-

What is the social sentiment of my brand or products? How effective is our online campaign? How can I optimize my traffic to reach the target audience?

Live data feeds:-

How can we optimize the fleet based on weather and traffic patterns?

Advanced Analytics:-

How can we better predict our future outcomes?

Hadoop:-

Big Data Processing Platform. Use the “MAP-Reduce” processing paradigm. Characteristics:

i>Highly Scalable (Scaled out).ii>Commodity Hardware-based.iii> Open source -> Very low cost for acquisition and storage costs.

Hadoop is consist of two different parts and they are Hadoop Distributed File System (HDFS)and MapReduce Framework.

Hadoop Eco-System:-

HDFS Architecture:-

In HDFS, NameNode is the node which actually receive all the requests coming towards the system and manages all the datanodes (datanodes are the commodity machine which does the computation as well as storing of data) in the cluster. When data comes to NameNode it split the incoming volumes into multiple blocks and evenly shared among datanodes. Data will be replicated (for high availability) as per the policy (default value is 3) i.e. every block will be copied N times and stored in different datanodes.

Secondary NameNode stores the metadata of Primary NameNode, so if at any point the primary goes down also secondary NameNode can be used as an alternative option. As

automatic failover does not support, so we need to manually change the Secondary NameNode to Primary NameNode.

MapReduce Framework:-

MapReduce consist of multiple functions which is being performed to come to the final stage of any result set. Below diagram has depicted the same-

Pic:- Flow of MapReduce

Hadoop 1.x- In Summary:-

Limitation of Hadoop 1.x:-

No Horizontal scalability of NameNode:-

Challenges:-

i. Metadata will store in NameNode memory i.e RAM.ii. Bottleneck after ~4000 Nodes.iii. Results in cascading failures of DataNode.

Does not support NameNode High Aviability:-

Challenges:-

i. Secondary NameNode is not aHot Standby for the NameNode.

Overburdened JobTracker:-

Challenges:-

i. CPU spends a very significant portion of time and effort managing the life cycle of applications.

ii. Single Network Listener Thread to communicate with thousands of Map and Reduce jobs.

No possible to run Non-MapReduce Big Data Applications on HDFS:-

Challenges:-

i. Only MapReduce processing can be achieved.ii. Alternate Data Storage is needed for other processing such as Real-time and Graph

Analysis. Does not support Multi-tenancy.

Hadoop 2.x:- Enhanced features are as follows-

HDFS Federation. Support NameNode High Availability. YARN- Yet Another Resource Negotiator.

i. Better Processing Control.ii. Support for non-MapReduce type of processing.iii. Support for Multi-tenancy.

Hadoop 2.x- In Summary:-

Pic:- Structure of Hadoop 2.x

Yet Another Resource Negotiator (YARN):-It makes enable to run multiple types of workloads.

Multi-tenancy - Capacity Scheduler:-

Structure difference of Hadoop 1.x and 2.x:-

bigdata & hadoop

Technology