jinseog kim dep. of applied statistics, dongguk university...

R: 통계 및 빅데이터 분석 소프트웨어

Jinseog KimDep. of Applied Statistics, Dongguk University

Email: [email protected]

Jinseog Kim Dep. of Applied Statistics, Dongguk University Email: [email protected] for big data 1 / 98

R의 다운로드 및 설치

R의 소개

1995년 Robert Gentleman과 Ross Ihaka(뉴질랜드 오클랜드 대학)에 의해서 개발

1970년 중반 AT&T의 벨 연구소에서 개발된 S언어를 기반으로 함

무료 공개 소프트웨어



R의 설치

http://www.r-project.org/

Figure : R homepage



R의 실행

바탕화면 더블클릭



R의 실행

“>”(command prompt) 이후에 명령어 입력

엔터키(Enter)를 입력하면 명령이 수행

명령이 종료되지 않은 경우 “+”가 나타나며 계속하여 명령을 입력



R의 실행



Rstudio의 설치

개발자를 위한 통합환경(IDE), http://www.rstudio.com/


R 객체

R의 기초 용어 및 유틸리티

객체(object): R에서는 자료, 함수, 연산자등은 모두 객체, 메모리에 저장

R 작업공간(workspace): 작업중 만들어지는 객체들의 모임

ls() : 객체들의 리스트

ls() #

## character(0)

rm() : R 객체를 삭제

x <- 1

y <- 1:10

ls()

## [1] "x" "y"

rm(x,y)

ls()

## character(0)


R 객체


R 작업공간(R workspace): R을 이용하여 작업하는 동안 만들어지는 객체(object)들의 모임(collection)

help()는 R 객체들에 대한 도움말을 출력해 주며, help() 대신에 ?객체명을 사용할 수

있다.

help(ls)

?ls

작업디렉토리(working directory)의 확인 및 변경

getwd()

## [1] "D:/Dropbox/bigdata/lectures"

#setwd("D:/share/lectures/R-note")

현재 작업공간(working directory)은 save.image()를 이용하여 저장되며, 이 때작업디렉토리에는 .RData라는 파일이 생성된다.

save.image()


R 객체


R패키지: R의 확장기능 이용 ⇐ R패키지 추가 설치search()는 설치된 R패키지들을 확인하는 명령

search()

## [1] ".GlobalEnv" "package:knitr" "package:stats"

## [4] "package:graphics" "package:grDevices" "package:utils"

## [7] "package:datasets" "package:methods" "Autoloads"

## [10] "package:base"


R 객체


library(): R에 설치된 모든 패키지 및 설명

library()

library(package_name)는 새로운 패키지를 현재 R세션으로 불러들는 함수

library(MASS)


R 객체


install.packages(): R에 새로운 패키지 설치

install.packages("stringr")

help(), ?: 함수 및 객체에 대한 도움말

help("ls")

?ls


R 객체(R objects)

R 객체

R 객체에는 아래와 같은 종류들이 있음

atomic(상수)

vector(벡터)

matrix(행렬)

list(리스트)

data.frame(데이터프레임)

function(함수)

operator(연산자) ...

R 객체들 중에서 데이터를 저장하기 위한 객체 atomic, vector, matrix, data.frame ⇒데이터객체 (data object)


R 객체(R objects)

데이터객체의 타입(Type)

실수형(double)

정수형(integer)

문자형(character)

논리형(logical),

복소수형(complex number)


R 객체(R objects)

데이터객체 저장모드(storage mode)

수치형(numeric) : 실수형, 정수형


논리형(logical)

복소수형(complex number)


R 객체(R objects)

데이터객체 클래스(class)

실수형(double)

정수형(integer)


논리형(logical),

팩터형(factor): 혈액형, 성별등 범주형자료의 표현

행렬(matrix)

리스트(list)

데이터프레임(data.frame)


R 객체(R objects)

데이터객체 예제

실수형(double) / 정수형(integer)

typeof(10L);mode(10L)

## [1] "integer"

## [1] "numeric"

typeof(10);mode(10)

## [1] "double"

## [1] "numeric"


typeof("Hello World"); mode("Hello World")

## [1] "character"

## [1] "character"

논리형(logical)

typeof(2 < 4); mode(2 < 4)

## [1] "logical"

## [1] "logical"


자료구조 (Data structure)

벡터 (vector)

벡터는 하나 이상의 원소로 이루어진 자료

벡터를 구성하는 각 원소는 그 유형(data type)이 동일해야 함⇒ (1,2,"a","b")는 잘못된 벡터

x1 <- c(1,2,3,4)

x2 <- 1:3

x3 <- c("A", "B", "C")

y <- c(x1, 0, x2) # 1,2,3,4, 0, 1,2,3

c(,...,)는 벡터를 생성하는 함수

: 는 연속된 정수벡터를 생성하는 연산자



벡터 (vector)

벡터를 생성하는 함수로는 아래와 같은 것들이 있다.

rep : 반복

rep(2, 10)

## [1] 2 2 2 2 2 2 2 2 2 2

rep(c(1,2), each=5)

## [1] 1 1 1 1 1 2 2 2 2 2

seq : 등차 수열 생성

seq(0, 1, length=11)

## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

1에서 9까지 2씩 증가하는 숫자로 이루어진 벡터를 만듬

seq(1, 9, by = 2)

## [1] 1 3 5 7 9

numeric, double, integer, character: 속성이 numeric, double, integer, 혹은 character인벡터를 괄호안의 수만큼 할당함

integer(length = 10)

## [1] 0 0 0 0 0 0 0 0 0 0



벡터의 클래스

numeric: 연속형

factor: 범주형

ordered: 순서있는 범주형

R code mode(x) class(x)x <- c(1:10) ”numeric” ”numeric”

x <- factor(1:10) ”numeric” ”factor”x <- ordered(1:10) ”numeric” ”ordered” ”factor”



벡터의 인덱싱

R 데이터오브젝트의 component를 접근하는 방법은 아래와 같이 인덱스와 component이름을이용한다.

object[ arg1, ... , argn ] # for vector, matrix, array

object[[ arg1, ... , argn ]] # for list

object$tag # for data.frame or named list



matrix(행렬)

matrix()함수 이용

X1 <- matrix(1:20, nrow=2, ncol=5); X1

## [,1] [,2] [,3] [,4] [,5]

## [1,] 1 3 5 7 9

## [2,] 2 4 6 8 10

diag()함수이용 대각행렬 생성

X2 <- diag(1, 5); X2

## [,1] [,2] [,3] [,4] [,5]

## [1,] 1 0 0 0 0

## [2,] 0 1 0 0 0

## [3,] 0 0 1 0 0

## [4,] 0 0 0 1 0

## [5,] 0 0 0 0 1

X2 <- diag(10)

X2 <- diag(1:10)

X2 <- diag(c(1,3,5,7,9))



행렬/벡터의 결합

열단위 결합

x <- c(1,2,3); y <- c(4,5,6)

cbind(x,y)

## x y

## [1,] 1 4

## [2,] 2 5

## [3,] 3 6

행단위 결합

rbind(x,y)

## [,1] [,2] [,3]

## x 1 2 3

## y 4 5 6



리스트(list)

List는 서로 다른 R오브젝트들을 원소(component)로 가지는 오브젝트

숫자벡터, 논리값, 행렬, 문자, 배열, 함수등 모든 R오브젝트가 리스트의 원소가 될 수 있다.



리스트의 생성

list()를 이용

list(name_1=object_1, ..., name_m=object_m)

여기서 name_1... name_m은 콤포넌트의 이름object_1... 은 콤포넌트 값

예

Lst <- list(name="fred", wife="mary", child.ages=c(4,7,9))

Lst

## $name

## [1] "fred"

##

## $wife

## [1] "mary"

##

## $child.ages

## [1] 4 7 9



구성요소에 대한 접근방법

[[]]

Lst[[1]]

## [1] "fred"

구성요소 이름이 있는 경우

Lst[["name"]]; # or Lst£name

## [1] "fred"

서브리스트(sub-list)

Lst[2:3]

## $wife

## [1] "mary"

##

## $child.ages

## [1] 4 7 9

콤포넌트의 개수: length()

length(Lst)

## [1] 3



리스트의 결합

c() : 벡터의 생성 또는 결합과 동일

list1 <- list(a1=1, b1=1:3)

list2 <- list(a2=c("Kim", "Park"))

c(list1, list2)

## $a1

## [1] 1

##

## $b1

## [1] 1 2 3

##

## $a2

## [1] "Kim" "Park"



데이터프레임(data frame)

데이터프레임은 아래의 특징을 가지는 리스트

벡터, 펙터(factor), 행렬, 리스트 또는 다른 데이터프레임을 구성요소로 가짐

행렬, 리스트 그리고 데이터프레임의 행, 구성요소 또는 변수는 새로운 데이터프레임의 행,구성요소 또는 변수

벡터(숫자, 문자등)는 데이터프레임의 열

데이터 프레임에 포함된 변수(열)는 길이가 모두 동일



데이터 프레임 만들기

data.frame()

name <- c("kim","lee","park","Oh")

sex <- c('f','m','f','m')

income <- c(100,102,300,204)

d1 <- data.frame(name=name, gender=sex, incom=income)

d1

## name gender incom

## 1 kim f 100

## 2 lee m 102

## 3 park f 300

## 4 Oh m 204

as.data.frame() 리스트나 행렬을 데이터프레임으로 변환



데이터 프레임관련 함수

앞줄 보기

head(d1, 2)

## name gender incom

## 1 kim f 100

## 2 lee m 102

변수명 출력

names(d1)

## [1] "name" "gender" "incom"

데이터 차원출력

nrow(d1) # number of rows

## [1] 4

ncol(d1) # number of columns

## [1] 3

dim(d1) # row and column dimension

## [1] 4 3



형변환 함수

as.numeric()

as.character()

as.matrix()

as.data.frame()

unlist()


파일에서 데이터 읽어오기

외부 text 파일

외부파일을 다음의 형식을 만족

파일의 첫 번째 줄은 변수명을 지정

관측치을 변수명에 대응하는 순서대로 입력

예) 위의 형식에 의하여 작성된 외부파일(titanic.txt)

Surv N Class Age Sex

20 23 Crew Adult Female

192 862 Crew Adult Male

1 1 First Child Female

5 5 First Child Male

13 13 Second Child Female

...



read.table() 함수

예제 데이터를 데이터프레임(titanic)으로 변환

titanic <- read.table("data/titanic.txt", header=T)

head(titanic)

## Surv N Class Age Sex

## 1 20 23 Crew Adult Female

## 2 192 862 Crew Adult Male

## 3 1 1 First Child Female

## 4 5 5 First Child Male

## 5 140 144 First Adult Female

## 6 57 175 First Adult Male



read.csv() 함수

원본 데이터를 Excel 파일로 편집하는 경우 다음과 같은 방식으로 R data.frame으로 불러들일 수있다. as a CSV file (Comma Separated Values).

CSV(Comma Separated Values)형식으로 저장: 파일 ⇒ Save As:

아래의 함수를 이용한다.

my.table=read.csv(file.choose()) ## using dialog box

my.table=read.csv("c:/xfile.csv") ## file name



scan() 함수

아래와 같이 텍스트 데이터 파일(’input.dat’) 입력되어 있다고 하자.

52.00 54.75 57.50

57.50 59.75 111.0

128.0 101.0 131.0 93.0

이러한 파일을 읽기 위해서 scan() 함수를 이용한다.

inp <- scan("input.dat")



edit() 함수

기존의 데이터(olddata)를 수정할 때

newdata <- edit(olddata)

새로운 데이터를 편집할 때:

xnew <- edit(data.frame())



RODBC를 이용한 엑셀파일 접근

R에서 엑셀파일을 연결하는 함수로 엑셀파일의 버전에 따라 odbcConnectExcel과

odbcConnectExcel2007를 제공한다.

odbcConnectExcel(xls.file, readOnly = TRUE, ...)

odbcConnectExcel2007(xls.file, readOnly = TRUE, ...)

예를 들어 W:/data/에 API2.xls파일을 접속하는 코드는 아래와 같다.

x.con = odbcConnectExcel("T:/data/test.xlsx",

readOnly=F)



ODBC: Excel 파일정보 확인

> odbcGetInfo(con)

DBMS_Name DBMS_Ver Driver_ODBC_Ver Data_Source_Name Driver_Name Driver_Ver

"EXCEL" "08.00.0000" "03.51" "" "odbcjt32.dll" "04.00.6305"

ODBC_Ver Server_Name

"03.52.0000" "EXCEL"

> tbls=sqlTables(con)

TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS

1 W:\\data\\API2 <NA> API$ SYSTEM TABLE <NA>

sqlTables함수는 접속한 엑셀파일의 정보를 보여주며, 특히 여기서 TABLE_NAME항목은

엑셀파일에 있는 워크쉬트의 이름을 알려준다. 여기서 주의할 점은 실제 엑셀파일의워크시트이름은 API이지만 sqlTables함수에서 보여주는 이름은 API$이며, 쿼리문을 사용할때는 [API$]를 이용한다.



ODBC: SQL을 이용한 데이터 검색

sqlQuery함수: 엑셀파일의 워크시트를 읽어오기 위한 RODBC 함수, 함수의 인수는 SQL문

SQL문에 따라 분석에 필요한 자료를 생성할 수 있다.

sqlQuery함수의 수행결과는 R 데이터프레임 객체로 변환되어 저장되며,

이후 R에서는 이 객체를 이용하여 다양한 분석을 하게 된다.



ODBC: sqlQuery를 이용한 검색

> a=sqlQuery(con, "select * from [API$]", as.is=T)

> head(a)

id type name region api100 api99 diff nstud

1 01611190130229 H Alameda High Alameda 731 693 38 1090

2 01611190132878 H Encinal High Alameda 622 589 33 840

3 01611196000004 M Chipman Middle Alameda 622 572 50 472

4 01611196090005 E Lum (Donald D.) Alameda 774 732 42 272

5 01611196090013 E Edison Elementa Alameda 811 784 27 216

6 01611196090021 E Otis (Frank) El Alameda 780 725 55 247

> b=sqlQuery(con, "select id,region, api100,api99 from [API$] where type='H'", as.is=T)

> head(b)

id region api100 api99

1 01611190130229 Alameda 731 693

2 01611190132878 Alameda 622 589

3 01611270130450 Alameda 789 773

4 01611430131177 Alameda 716 728

5 01611500132225 Alameda 741 723

6 01611680132746 Alameda 491 443



ODBC

또한, 엑셀파일 접속시 옵션 readOnly=F를 사용한 경우, 아래처럼 SQL의 update문을 사용할 수

있다.

> sqlQuery(con, "update [API$] set type='H' where id='01611190130229'")



XLConnect

XLConnect는 R에서 Microsoft Excel 데이터를 핸들링하기 위한 패키지로 다양한 OS에서 사용할수 있다.

library(XLConnect)

df <- readWorksheetFromFile("<file name and extension>",

sheet=1,

startRow = 4,

endCol = 2)

wb <- loadWorkbook("<name and extension of your file>")

df <- readWorksheet(wb, sheet=1)

sheet : sheet name or index.

startRow/startCol: row or column the data set should be imported,

endRow/endCol

region: range (eg A5:B5)



XLConnect


library(XLConnect)



XLConnect


# Excel 파일

demoExcelFile <- system.file("demoFiles/mtcars.xlsx",

package = "XLConnect")

# 엑셀파일 로딩

wb <- loadWorkbook(demoExcelFile)

# 엑셀파일의 'mtcars'시트에서 데이터를 읽어옴

dt <- readWorksheet(wb, sheet = "mtcars")

head(dt)

## mpg cyl disp hp drat wt qsec vs am gear carb

## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1



XLConnect

iris data.frame을 XLConnect를 이용하여 품종별로 엑셀의 서로다른 워크시트에 저장하는프로그램

# Load workbook (create if not existing)

wb <- loadWorkbook("iris.xlsx", create = TRUE)

## Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2

stream, nor an OOXML stream

Species <- as.character(unique(iris$Species))

for(sp in Species){# Create worksheet

createSheet(wb, name = sp)

# Write data to worksheet (메모리에만 저장되며 파일에는 저장되지 않음)

writeWorksheet(wb, iris[iris$Species==sp,],

sheet = sp, header=TRUE)

}# 아래 코드에 의해 파일로 저장됨

saveWorkbook(wb)


R 연산자/내장함수

R 연산자

Operator Descriptions- , +, *, / Minus,Plus, Multiplication, Division

%% Modulus(나머지연산)%/% Integer division(정수나누기의 몫)< Less than> Greater than== Equal to>= Greater than or equal to<= Less than or equal to! Unary not^ Exponentiation& And, vectorized&& And| Or, vectorized|| Or<- Left assignment= Left ssignment-> Right assignment<<- global assignment(함수 외부의 변수값 지정)



R 연산자 예제

x <- c(1, 10, 13, 3)

x %% 2

## [1] 1 0 1 1

x%/% 3

## [1] 0 3 4 1

x > 3

## [1] FALSE TRUE TRUE FALSE

y <- c(3, 5, 2, 1)

x>y

## [1] FALSE TRUE TRUE TRUE

z <- TRUE

!z

## [1] FALSE



R 연산자 예제

x1 <- x%%2; x1

## [1] 1 0 1 1

y1 <- y%%2; y1

## [1] 1 1 0 1

x1 | y1

## [1] TRUE TRUE TRUE TRUE



내장함수(built-in functions)

함수 R 함수제곱근 sqrt지수함수 exp로그함수 log(5), log2(5), log10(5), log(5, base=3)최대값 max, pmax최소값 min, pmin합 sum평균 mean절대값 abs누적연산 cummax, cummin, cumprod, cumsum삼각함수 sin, cos, tan

올림,반올림... ceiling, round, trunc, floor



R built-in functions

a <- 1:5

sqrt(a)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068

exp(a)

## [1] 2.718282 7.389056 20.085537 54.598150 148.413159

out <- (a + sqrt(a))/(exp(2)+1); out

## [1] 0.2384058 0.4069842 0.5640743 0.7152175 0.8625604

x1 <- seq(-2, 4, by = .5); x1

## [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

floor(x1)

## [1] -2 -2 -1 -1 0 0 1 1 2 2 3 3 4

trunc(x1)

## [1] -2 -1 -1 0 0 0 1 1 2 2 3 3 4



R built-in functions

a <- c(1,-2,3,-4)

b <- c(-1,2,-3,4)

min(a,b)

## [1] -4

pmin(a,b)

## [1] -1 -2 -3 -4



Other built-in functions

print(): Prints a single R object

a <- c(5,3,6,2,4)

print(a)

## [1] 5 3 6 2 4

cat(): Prints multiple objects, one after the other

cat("mean of a is ",mean(a), "variance of a is ", var(a),"\n")

## mean of a is 4 variance of a is 2.5

unique():Gives the vector of distinct values

x <- c(1,5,1,3,5,7,5)

unique(x)

## [1] 1 5 3 7



Other built-in functions

diff(): Replace a vector by the vector of first differences

diff(x)

## [1] 4 -4 2 2 2 -2

sort(): Sort elements into order, but omitting NAs

order(): x[order(x)] orders elements of x, with NAs last

rev(): reverse the order of vector elements

print(x)

## [1] 1 5 1 3 5 7 5

sort(x)

## [1] 1 1 3 5 5 5 7

order(x)

## [1] 1 3 4 2 5 7 6

rev(x)

## [1] 5 7 5 3 1 5 1


R 데이터 핸들링

인덱싱(indexing)

x <- sample(1:10, 15, rep=T)

x

## [1] 1 9 3 1 2 4 5 1 7 6 8 4 1 5 8

others <- (x > 1)

others

## [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE

## [12] TRUE FALSE TRUE TRUE

x[others]

## [1] 9 3 2 4 5 7 6 8 4 5 8

ind <- which(x > 1)

ind

## [1] 2 3 5 6 7 9 10 11 12 14 15

x[ind]

## [1] 9 3 2 4 5 7 6 8 4 5 8

x[!others]

## [1] 1 1 1 1

x[-ind]

## [1] 1 1 1 1



subscripting(데이터에서 일부분을 추출)

USArrests data :This data set contains statistics, in arrests per 100,000 residents for assault, murder, andrape in each of the 50 US states in 1973. Also given is the percent of the population living inurban areas.

A data frame with 50 observations on 4 variables.1 Murder: numeric Murder arrests (per 100,000)2 Assault: numeric Assault arrests (per 100,000)3 UrbanPop: numeric Percent urban population4 Rape: numeric Rape arrests (per 100,000)

head(USArrests,3)

## Murder Assault UrbanPop Rape

## Alabama 13.2 236 58 21.2

## Alaska 10.0 263 48 44.5

## Arizona 8.1 294 80 31.0



subscripting

Numeric subscripts

# Top 5 states with high murder rate

nidx <- order(USArrests$Murder, decreasing=T)[1:5]

nidx

## [1] 10 24 9 18 40

USArrests[nidx,]


## Georgia 17.4 211 60 25.8

## Mississippi 16.1 259 44 17.1

## Florida 15.4 335 80 31.9

## Louisiana 15.4 249 66 22.2

## South Carolina 14.4 279 48 22.5



subscripting

Logical subscripts

lidx <- (USArrests$Murder

< quantile(USArrests$Murder, 0.1))

head(lidx, 10)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

USArrests[lidx,]


## Iowa 2.2 56 57 11.3

## Maine 2.1 83 51 7.8

## New Hampshire 2.1 57 56 9.5

## North Dakota 0.8 45 44 7.3

## Vermont 2.2 48 32 11.2



subset

subset함수

subset(USArrests, UrbanPop > 85)


## California 9.0 276 91 40.6

## New Jersey 7.4 159 89 18.8

## New York 11.1 254 86 26.1

## Rhode Island 3.4 174 87 8.3

subset(USArrests, UrbanPop < 40 & Murder < 10,

select = c(Assault, Rape))

## Assault Rape

## Vermont 48 11.2

## West Virginia 81 9.3



데이터 결합

authors

## surname nationality

## 1 Tukey US

## 2 Venables Australia

## 3 Tierney US

## 4 Ripley UK

## 5 McNeil Australia

books

## name title

## 1 Tukey Exploratory Data Analysis

## 2 Venables Modern Applied Statistics ...

## 3 Tierney LISP-STAT

## 4 Ripley Spatial Statistics

## 5 Ripley Stochastic Simulation

## 6 McNeil Interactive Data Analysis

## 7 R Core An Introduction to R



merge() : 데이터 결합 (2)

authors의 ”surname”과 authors, books의 ”name”을 키로 결합

m1 <- merge(authors, books, by.x = "surname", by.y = "name")

m1

## surname nationality title

## 1 McNeil Australia Interactive Data Analysis

## 2 Ripley UK Spatial Statistics

## 3 Ripley UK Stochastic Simulation

## 4 Tierney US LISP-STAT

## 5 Tukey US Exploratory Data Analysis

## 6 Venables Australia Modern Applied Statistics ...



aggregate() : 데이터 요약

Splits the data into subsets, computes summary statistics for each

aggregate(요약변수, list(그룹화변수), 요약함수)

aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")

aggregate(Sepal.Length~Species, data=iris, FUN=mean)

## Species Sepal.Length

## 1 setosa 5.006

## 2 versicolor 5.936

## 3 virginica 6.588



apply() : 데이터 요약

Apply Functions Over Array Margins

apply(array, MARGIN, FUN, ...)

apply(iris[, 1:4], 2, mean)

## Sepal.Length Sepal.Width Petal.Length Petal.Width

## 5.843333 3.057333 3.758000 1.199333

apply(iris[, 1:4], 1, sum)[1:10]

## [1] 10.2 9.5 9.4 9.4 10.2 11.4 9.7 10.1 8.9 9.6



lapply() : 벡터화에 의한 데이터 요약

Apply a Function over a List or Vector

lapply(vector or list, FUN, ...)

lapply(iris[1:4], mean) # lapply(iris[,1:4], mean)

## $Sepal.Length

## [1] 5.843333

##

## $Sepal.Width

## [1] 3.057333

##

## $Petal.Length

## [1] 3.758

##

## $Petal.Width

## [1] 1.199333



lapply() : 데이터 요약

lapply(1:4, function(i) mean(iris[,i]))

## [[1]]

## [1] 5.843333

##

## [[2]]

## [1] 3.057333

##

## [[3]]

## [1] 3.758

##

## [[4]]

## [1] 1.199333



data.table 객체생성

data.table(...)

library(data.table)

DT <- data.table(x=c("b","b","a","a"),v=rnorm(4))

DT

## x v

## 1: b 1.1902136

## 2: b -1.4447011

## 3: a -0.3003707

## 4: a -0.2358570



data.frame으로부터 data.table 객체생성

CARS <- data.table(cars)

head(CARS)

## speed dist

## 1: 4 2

## 2: 4 10

## 3: 7 4

## 4: 7 22

## 5: 8 16

## 6: 9 10



data.table 목록

tables()

## NAME NROW NCOL MB COLS KEY

## [1,] CARS 50 2 1 speed,dist

## [2,] DT 4 2 1 x,v

## Total: 2MB



Group summary

Iris <- data.table(iris)

names(Iris)

## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"

## [5] "Species"

Iris[, mean(Petal.Width), by="Species"]

## Species V1

## 1: setosa 0.246

## 2: versicolor 1.326

## 3: virginica 2.026

Iris[,lapply(.SD, mean),by=Species]

## Species Sepal.Length Sepal.Width Petal.Length Petal.Width

## 1: setosa 5.006 3.428 1.462 0.246

## 2: versicolor 5.936 2.770 4.260 1.326

## 3: virginica 6.588 2.974 5.552 2.026

tapply(iris$Petal.Width, iris$Species, mean)

## setosa versicolor virginica

## 0.246 1.326 2.026



sqldf 패키지

sqldf : R의 데이터프레임을 SQL 문법을 이용하여 조작

## Load the package

library(sqldf)

# Use the iris data set

sqldf('select count(*) `N`,

AVG("Sepal.Width") `Sepal.Length`

from iris group by Species')

## N Sepal.Length

## 1 50 3.428

## 2 50 2.770

## 3 50 2.974



sqldf

system.time({a8r <- aggregate(iris[1:2], iris[5], mean)

})

## user system elapsed

## 0 0 0

system.time({a8s <- sqldf('select Species,

avg("Sepal.Length") `Sepal.Length`,

avg("Sepal.Width") `Sepal.Width`

from iris group by Species')

})

## user system elapsed

## 0 0 0


R 프로그래밍

조건문

조건문에 해당되는 표현으로는 다음의 3가지 종류가 있다.

if ( cond ) expr

if ( cond ) expr1 else expr2

if ( cond1 ) expr1

else if( cond2 ) expr2

else expr3


R 프로그래밍

조건문

iris 데이터에서 Sepal.Length의 median을 구하고 Sepal.Length를 median보다 크면 ”L”, 작으면”S”가 되도록 하여라.

data(iris)

n=length(iris$Sepal.Length)

Sepal.Length.Cat = character(n)

Med=median(iris$Sepal.Length)

for(i in 1:n){if(iris$Sepal.Length[i]<Med) {

Sepal.Length.Cat[i] = "S"

} else {Sepal.Length.Cat[i] = "L";

}}Sepal.Length.Cat[1:10]

## [1] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S"


R 프로그래밍

순환문

순환문의 표현으로는 다음 세가지 표현을 사용한다.

while ( cond ) expr

repeat expr

for ( var in list ) expr

break : while, repeat, for에서 순환문을 끝내는 구문

next : 이후의 문장을 건너뛰고 다음 순환


R 프로그래밍

순환문

표준정규분포에서 100개의 난수발생

x <- rnorm(10)

sum.positive <- 0

for(i in 1:length(x)){if(x[i] > 0) sum.positive <- sum.positive + x[i]

}sum.positive

## [1] 3.869031


R 프로그래밍

함수의 작성

함수작성방법

function_name <- function(arg_1, arg_2, ...){

expression...;

return(...)

}

(mile을 km로 바꾸는 프로그램)

miles.to.km <- function(miles) miles*8/5

miles.to.km(175) # Approximate distance to Sydney, in miles

## [1] 280

- 만일 100, 200 300 miles를 kilometer로 바꾼다면

miles.to.km(c(100,200,300))

## [1] 160 320 480


R 그래픽스

그래프를 위한 기본함수

고수준 함수: plot(), barplot(), boxplot(), hist(), pie(), persp()

저수준함수

점그리기: points()선그리기: lines(), abline(), arrows()문자출력: text()도형: rect(), ploygon()좌표축: axis()격자표현: grid()


R 그래픽스

plot()

x <- rnorm(100, sd=2); y <- 0.3 + 2*x + rnorm(100, sd=1)

plot(x)

0 20 40 60 80 100

−4

−2

02

4

Index

x

0 20 40 60 80 100

−4

−2

02

4

Index

x


R 그래픽스

bar plot

#par(mai=c(2,1,0.5,0.5))

pie.sales <- c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)

names(pie.sales) <- c("Blueberry", "Cherry", "Apple", "Boston Cream",

"Other", "Vanilla Cream")

barplot(pie.sales, las=2) #las=2: x-tick

Blu

eber

ry

Che

rry

App

le

Bos

ton

Cre

am

Oth

er

Van

illa

Cre

am0.00

0.05

0.10

0.15

0.20

0.25

0.30


R 그래픽스

bar plot (2)

counts <- table(mtcars$vs, mtcars$gear)

#par(cex=1.5)

barplot(counts, main="Car Distribution by Gears and VS",

xlab="Number of Gears", col=c("darkblue","red"),

legend = rownames(counts), beside=TRUE)

3 4 5

01

Car Distribution by Gears and VS

Number of Gears

02

46

810

12


R 그래픽스

bar plot (3)

#par(cex=1.5)

barplot(counts, main="Car Distribution by Gears and VS",

xlab="Number of Gears", col=c("darkblue","red"),

legend = rownames(counts))

3 4 5

10

Car Distribution by Gears and VS

Number of Gears

02

46

810

1214


R 그래픽스

pie plot

pie(pie.sales)

Blueberry

Cherry

Apple

Boston Cream

Other

Vanilla Cream


R 그래픽스

mtcar data

head(mtcars)

## mpg cyl disp hp drat wt qsec vs am gear carb

## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1


R 그래픽스

Dot Chart

dotchart(mtcars$mpg,labels=row.names(mtcars),

cex=0.7,

main="Gas Milage \nfor Car Models",

xlab="Miles Per Gallon")

Mazda RX4

Mazda RX4 Wag

Datsun 710

Hornet 4 Drive

Hornet Sportabout

Valiant

Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C

Merc 450SE

Merc 450SL

Merc 450SLC

Cadillac Fleetwood

Lincoln Continental

Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona

Dodge Challenger

AMC Javelin

Camaro Z28

Pontiac Firebird

Fiat X1−9

Porsche 914−2

Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

10 15 20 25 30

Gas Milage for Car Models

Miles Per Gallon


R 그래픽스

Dot Chart (2)

idx <- order(mtcars$mpg)

dotchart(mtcars$mpg[idx],labels=row.names(mtcars)[idx], cex=1,

main="Gas Milage \nfor Car Models", xlab="Miles Per Gallon")

Cadillac FleetwoodLincoln ContinentalCamaro Z28Duster 360Chrysler ImperialMaserati BoraMerc 450SLCAMC JavelinDodge ChallengerFord Pantera LMerc 450SEMerc 450SLMerc 280CValiantHornet SportaboutMerc 280Pontiac FirebirdFerrari DinoMazda RX4Mazda RX4 WagHornet 4 DriveVolvo 142EToyota CoronaDatsun 710Merc 230Merc 240DPorsche 914−2Fiat X1−9Honda CivicLotus EuropaFiat 128Toyota Corolla

10 15 20 25 30

Gas Milage for Car Models

Miles Per Gallon


R 그래픽스

par(): 그래프 옵션

x<-rnorm(100)

par(mfrow=c(1,2))

hist(x)

plot(x)

Histogram of x

x

Fre

quen

cy

−2 −1 0 1 2 3

05

1015

20

0 20 40 60 80 100

−2

−1

01

23

Index

x


R 그래픽스

그래프을 이용한 요약

cars 데이터는 자동차의 속도(speed)와 정지시까지 거리(dist)

data(cars)

head(cars, 3)

## speed dist

## 1 4 2

## 2 4 10

## 3 7 4

tail(cars, 3)

## speed dist

## 48 24 93

## 49 24 120

## 50 25 85


R 그래픽스

히스토그램(Histogram)

hist(cars$speed, nclass=8, main="Histogram", xlab="speed")

Histogram

speed

Fre

quen

cy

5 10 15 20 25

02

46

8

Figure : 히스토그램


R 그래픽스

상자그림(box plot)

boxplot(Sepal.Length~Species, data=iris, main="Box plot")

setosa versicolor virginica

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Box plot

Figure : 상자그림


R 그래픽스

scatter plot (산점도)

plot(cars)

5 10 15 20 25

020

4060

8010

012

0

speed

dist

Figure : 산점도


R 그래픽스

scatter plot

data(iris)

plot(Petal.Length ~ Sepal.Length, data=iris, bty="l",pch=20)

abline(a=0,b=1,lty=2,lwd=2)

abline(lm(Petal.Length ~ Sepal.Length, data=iris),lty=1,lwd=2)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

12

34

56

7

Sepal.Length

Pet

al.L

engt

h

Figure : Sepal.Length v.s. Petal.Length


R 그래픽스

scatter plot - pair()함수 이용

pairs(iris[,1:4], main = "Fisher's Iris Data",

pch = 21,bg = c("red","green3","blue")[unclass(iris$Species)])

Sepal.Length

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

2.0

2.5

3.0

3.5

4.0

Sepal.Width

Petal.Length

12

34

56

7

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

1 2 3 4 5 6 7

Petal.Width

Fisher's Iris Data

Figure : pair()함수의 이용


R 그래픽스

정규확률플롯(QQ plot)

qqnorm(cars$speed)

qqline(cars$speed)

−2 −1 0 1 2

510

1520

25

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s


R 그래픽스

3d plot

A data frame with 31 observations on 3 variables.

[,1] Girth numeric Tree diameter in inches

[,2] Height numeric Height in ft

[,3] Volume numeric Volume of timber in cubic ft

head(trees)

## Girth Height Volume

## 1 8.3 70 10.3

## 2 8.6 65 10.3

## 3 8.8 63 10.2

## 4 10.5 72 16.4

## 5 10.7 81 18.8

## 6 10.8 83 19.7


R 그래픽스

3D plot

require(scatterplot3d)

scatterplot3d(trees, type="h", highlight.3d=TRUE,

angle=55, scale.y=0.7, pch=16, main="3 dimensional plot for trees data")

3 dimensional plot for trees data

8 10 12 14 16 18 20 22

1020

3040

5060

7080

60

65

70

75

80

85

90

Girth

Hei

ghtV

olum

e

Figure : trees 자료의 3차원 산점도


R 그래픽스

3D-파이차트 범주형 자료

slices <- c(18, 12, 4, 16, 8, 9, 12)

lbels <- c("US", "UK", "Australia", "Germany", "Canada", "India", "Korea")

library(plotrix)

pie3D(slices,labels=lbels,explode=0.1, main="3D Pie Chart", mar=c(4,0,3,0))

3D Pie Chart

US

UK

Australia

Germany

Canada India

Korea


jinseog kim dep. of applied statistics, dongguk university...

Documents