hadoop overview 1

Hadoop Overview

2008.01.15유현정

Hadoop

• Brief History– 2005 년 Doug Cutting(Lucene & Nutch 개발자 )

에 의해서 시작• Nutch 오픈소스 검색엔진의 분산확장 이슈에서 출발

– 2006 년 Yahoo 의 전폭적인 지원 (Doug Cutting 과 전담팀 고용 )

– 2008 년 Apache Top-level Project 로 승격– 현재 (2009.1) 0.19.0 release

Hadoop

• Java 기반 언어• Apache 라이선스• 많은 컴포넌트들 – HDFS, Hbase, MapReduce, Hadoop On

Demand(HOD), Streaming, HQL, Hama, etc

Hadoop Architecture

Hadoop

• 주요 기능– Distributed File System – Distributed computing

• Distributed File System(DFS)– 네트워크로 연결된 서버들의 저장공간을 하나로 묶은 대용량

가상공간에 파일들을 저장하는 시스템– 전세계의 웹 페이지에 있는 내용을 분석하여 구성한 index 파일과

같은 대용량과 동시에 엄청난 양의 transaction 을 처리해야 하는 요구사항에 부합되도록 설계

– 종류 • NHN + KAIST 가 공동으로 개발한 OwFS(Owner based File System)• Sun Microsystems 의 NFS • Microsoft 의 분산 파일 시스템• IBM 의 Transarc's DFS

Google File System(GFS)

• Hadoop 의 DFS 는 Google File System(GFS) 의 기본 개념을 그대로 가져와 구현함

• GFS 의 특징– PC 와 같은 일반적으로 값싼 장비를 이용한다– NAS 등과 같은 고비용의 장비를 사용하지 않고 소프트웨어로

해결한다– 많은 수의 대용량 파일 ( 수백 MB~ 수 GB) 을 처리할 수 있어야

한다– 추가로 데이터에 대한 백업을 하지 않는다– 장비의 추가 및 제거가 자유로워야 한다– 특정 노드 장애 시에도 별도의 복구 절차 없이 지속적인 서비스

제공이 가능하다


• 두 개의 데몬 서버 ( 사용자 application 수준 )– GFS master : 파일 이름 , 크기 등과 같은 파일에 대한 메타데이터 관리– GFS chunkserver : 실제 파일을 저장하는 역할 수행

• 수백 MB ~ 수 GB 이상의 크기의 파일 하나를 여러 조각으로 나눈 후 , 여러 chunkserver 에 저장

• 나누어진 파일의 조각 = chunk (default, 64MB)

• GFS Client : 데몬 서버들과 통신을 통해 파일 처리– 파일의 생성 , 읽기 , 쓰기 등의 작업을 수행하는 역할– API 형태로 제공되고 내부적으로 socket 등의 통신을 이용하여 서버와

통신

Google File System(GFS)• 동작 방식

– 1. Application 은 File System 에서 제공하는 API 를 이용하여 GFS Client 코드를 생성하여 파일 작업 요청

– 2. GFS Client 는 GFS master 에게 해당 파일에 대한 정보 요청– 3. GFS master 는 자신이 관리하는 파일 메타 데이터에서 client 가 요청한 파일의

정보를 전달• 전달되는 데이터는 파일 크기와 같은 정보와 조각으로 나뉘어진 chunk 수 , chunk size,

chunk 가 저장된 chunkserver 의 주소 값 등– 4. GFS client 는 해당 chunk 가 저장되어있는 chunkserver 로 접속한 후 파일 처리

요청– 5. GFS chunkserver 는 실제 파일에 대한 처리 수행

Client 와 master 사이에는 파일에 대한 정보만 주고 받을 뿐 , 실제 파일 데이터의 이동 및 처리는 client 와 chunkserver 사이에서 발생

Master 의 부하를 최소화 하도록 하기 위한 것


• Replication – 가장 큰 특징 중 하나– Chunk 를 하나의 chunkserver 에만 저장하는 것이

아니라 여러 개의 chunkserver 에 복사본을 저장• Default : 3 개의 복사본

– Chunkserver 의 down 등으로 인해 정해진 복사본 수만큼 가지고 있지 않는 경우 , master 는 새로운 복사본을 만들도록 관리


• Replication 의 장점– chunk 를 저장하고 있는 chunkserver 가 down

되어도 장애 없이 서비스를 제공할 수 있다– 항상 복사본이 존재하고 있기 때문에 파일 시스템

수준에서 RAID1 수준의 미러링 백업을 제공• 실제로 Google 은 NAS 와 같은 고비용의 스토리지 장비를

사용하지 않기 때문에 Yahoo 와 같은 다른 경쟁업체에 비해 월등하게 싼 가격으로 시스템을 운영하고 있다

– 특정 파일 또는 chunk 를 읽기 위한 접근이 집중되는 경우 , 하나의 chunkserver 로 집중되는 부하를 분산시킬 수 있으며 서버에 대한 분산뿐만 아니라 Disk head 와 같은 물리적인 장치에 대한 분산 효과


• Replication 의 단점– Google File System 에서는 , 복사본을 동기적인 방식으로

생성한다 . 이것은 파일을 생성하는 시점에 복사본까지 완전히 저장된 후에 파일 생성에 대한 완료처리를 한다는 것이다 . 따라서 , 일반 파일 시스템에 비해 기본적으로 3 배 이상의 write 속도 저하가 발생한다 . • Google File System 의 경우 , 온라인 실시간 시스템에서는 사용하기

부적합 할 수 있으며 파일을 한번 생성한 다음부터는 계속해서 읽기 작업만 발생하는 시스템에 적합하다고 할 수 있다 .

• 대표적인 서비스들 : 검색 , 동영상 , 이미지 , 메일 (Mime 파일 ) 등이다

– 3 개의 복사본을 저장하기 때문에 3 배의 디스크 공간 필요• Google File System 의 경우 , 기본적으로 PC 에 붙어 있는 값싼

디스크도 사용할 수 있기 때문에 NAS 와 같이 고비용의 스토리지를 사용하는 엔터프라이즈 환경에서는 단점이라고 할 수 없다 .

Hadoop Distributed File System(HDFS)

• 순수한 자바 파일 시스템• 기존의 분산 파일 시스템과 많은 유사성을 갖고

있으면서도 커다란 차이점이 있음 .– highly fault-tolerant – Low-cost hardware 를 통해 배포할 수 있도록 설계– Application data 접근에 높은 처리량 제공 – Suitable for application having large data sets

Assumptions & Goals of HDFS

• 1. hardware failure

– An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data.

– The fact that there are a huge number of component and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional.

– 따라서 , 결함의 탐지와 빠르고 자동적인 복구는 HDFS 의 핵심적인 구조적 목표 .


• 2. Streaming Data Access– HDFS 는 batch processing 에 적합– HDFS is optimized to provide streaming read

performance; this comes at the expense of random seek times to arbitrary positions in files.

– HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

– 강점 : high throughput of data access

*POSIX(Potable Operating System Interface)유닉스 운영체제에 기반을 두고 있는 일련의 표준 운영체제 인터페이스


• 3. Large Data Sets – HDFS 의 applications 는 large data sets 를 가짐– HDFS 에서의 일반적인 파일 크기 : gigabytes to

terabytes – 따라서 , HDFS 가 대용량 파일들을 제공할 수 있도록

조정됨• 높은 aggregate data bandwidth 와 단일 클러스터에서

수백개의 nodes 로의 확장을 제공해야 함• 단일 instance 에서 수 천만 파일을 제공해야 함


• Simple Coherency Model– HDFS’s applications need a write-once-read-

more access model for files. – 이러한 가정은 data coherency issues 를

단순화하고 높은 처리량의 데이터 접근을 가능하게 함– A Map/Reduce application or a web crawler

application fits perfectly with this model.– 추가 쓰기 (appending-writes) 를 앞으로는 제공할

계획• Scheduled to be included in Hadoop 0.19 but is not

available yet


• “Moving Computation is Cheaper than Moving Data”– Application 에 의해 요청된 computation 은 데이터

근처에서 실행될 때 더 효과적– 특히 , data sets 의 사이즈가 대단히 클 때 – This minimizes network congestion and increase the

overall throughput of the system– The assumption is that is often better to migrate the

computation closer to where the data is located rather than moving the data to where the application is running.

– 따라서 , HDFS 는 데이터가 위치한 곳 가까이 application들을 옮길 수 있는 interface 를 제공

Assumptions & Goals of HDFSkjuo

• 6. 이종 하드웨어와 소프트웨어 플랫폼으로의 이식성– HDFS 는 한 플랫폼에서 다른 플랫폼으로 쉽게 이식할

수 있도록 디자인됨– This facilitates widespread adoption of HDFS as

a platform of choice for large set of applications.

Features of HDFS

• A master/slave architecture

• A single NameNode, a master server for a HDFS cluster– manages the file system namespace

• Executes file system namespace operations like opening, closing, and renaming files and directories

– regulates access to files by clients

– Determines the mapping of blocks to DataNodes

Simplifies the architecture of the system NameNode = arbitrator and repository for all HDFS

metadata

Features of HDFS

• A number of DataNodes, usually one per node in the cluster– Manages storage attached to the nodes that

they run on • Perform block creation, deletion, and replication upon

instruction from the NameNode

Features of HDFS

• The NameNode and DataNodes are pieces of software designed to run on commodity machines.

• These machines typically run a GNU/Linux OS.

• Using the Java language; any machine that supports Java can run the NameNode or DataNode software.

Features of HDFS

• A typically deployment has a dedicated machine that runs only the NameNode software.

• Each of the other machines in the cluster runs one instance of the DataNode software

• -> The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

Features of HDFS

• Single namespace for entire cluster– Managed by a single NameNode– Files are write-once– Optimized for streaming reads of large files

• Files are broken in to large blocks– HDFS is a block-structured file system– Default block size

• in HDFS : 64 MB vs. in other systems : 4 or 8 KB

– These blocks are stored in a set of DataNodes

• Client talks to both NameNode and DataNodes– Data is not sent through the NameNode

Features of HDFS

• DataNodes holding blocks of multiple files with a replication factor of 2• The NameNode maps the filenames onto the block ids

Features of HDFS

• DataNodes 의 클러스터를 통해서 구축• 각각의 서버들은 데이터 블록을 네트워크를 통해

제공• 웹 브라우저나 다른 client 를 통해서 모든 컨텐츠에 대해 접근할 수 있도록 HTTP 프로토콜을 통해서 데이터를 제공하기도 함

Features of HDFS

• NameNode 라는 하나의 특별한 서버를 필요로 함

-> HDFS 설치에 있어서 실패의 한 요소 NameNode 가 다운되면 File system 역시 다운 • 2차 NameNode 를 운영하기도 함• 대부분은 하나의 NameNode 를 이용• Replay process 는 큰 클러스터의 경우 30 분

이상 소요

Features of HDFS

The File System Namespace

• File System Namespace 계층은 다른 file systems 와 유사 ;– 파일 생성 및 삭제– 하나의 dir 에서 다른 dir 로 파일 복사– 파일 이름 변경

• 사용자 한도 량과 접근 허가 구현 X• 하드 링크와 소프트 링크 지원 X• 하지만 , 이런 특징들의 구현을 제한 X

The File System Namespace

• NameNode : the file system namespace 유지

• File system namespace 나 속성의 변화는 NameNode 에 의해서 기록됨

• Application 은 HDFS 에 의해 유지되어야 하는 파일의 replicas 의 개수를 명시할 수 있다 .– 파일의 복제 수는 the replication factor of that

file 이라고 불리워짐 .– 이러한 정보 역시 NameNode 에 의해서 저장

hadoop overview 1

Documents