intro to r vietnamese

Upload: tuyen-nguyen

Post on 11-Oct-2015

268 views

Category:

Documents


0 download

DESCRIPTION

giới thiệu và hướng dẫn sử dụng R

TRANSCRIPT

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    1

    Phn tch s liu v biu bng

    Nguyn Vn Tun Garvan Institute of Medical Research

    Sydney, Australia

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    2

    Mc lc 1 Ti R xung v ci t vo my tnh 4 2 Ti R package v ci t vo my tnh 6 3 Vn phm R 7 3.1 Cch t tn trong R 9 3.2 H tr trong R 9 4 Cch nhp d liu vo R 10 4.1 Nhp s liu trc tip: c() 10 4.2 Nhp s liu trc tip: edit(data.frame()) 12 4.3 Nhp s liu t mt text file: read.table 13 4.4 Nhp s liu t Excel 14 4.5 Nhp s liu t SPSS 15 4.6 Thng tin v s liu 16 4.7 To dy s bng hm seq, rep v gl 17 5 Bin tp s liu 19 5.1 Tch ri s liu: subset 19 5.2 Chit s liu t mt data .frame 20 5.3 Nhp hai data.frame thnh mt: merge 21 5.4 Bin i s liu (data coding) 22 5.5 Bin i s liu bng cch dng replace 23 5.6 Bin i thnh yu t (factor) 23 5.7 Phn nhm s liu bng cut2 (Hmisc) 24 6 S dng R cho tnh ton n gin 24 6.1 Tnh ton n gin 24 6.2 S dng R cho cc php tnh ma trn 26 7 S dng R cho tnh ton xc sut 31 7.1 Php hon v (permutation) 31 7.2 Bin s ngu nhin v hm phn phi 32 7.3 Bin s ngu nhin v hm phn phi 32 7.3.1 Hm phn phi nh phn (Binomial distribution) 33 7.3.2 Hm phn phi Poisson (Poisson distribution) 35 7.3.3 Hm phn phi chun (Normal distribution) 36 7.3.4 Hm phn phi chun chun ha (Standardized Normal distribution) 38 7.4 Chn mu ngu nhin (random sampling) 41 8 Biu 42 8.1 S liu cho phn tch biu 42 8.2 Biu cho mt bin s ri rc (discrete variable): barplot 44 8.3 Biu cho hai bin s ri rc (discrete variable): barplot 45 8.4 Biu hnh trn 46 8.5 Biu cho mt bin s lin tc: stripchart v hist 47 8.5.1 Stripchart 47 8.5.2 Histogram 48 8.6 Biu hp (boxplot) 49 8.7 Phn tch biu cho hai bin lin tc 50 8.7.1 Biu tn x (scatter plot) 50 8.8 Phn tch Biu cho nhiu bin: pairs 53

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    3

    8.9 Biu vi sai s chun (standard error) 54 9 Phn tch thng k m t 55 9.1 Thng k m t (descriptive statistics, summary) 55 9.2 Thng k m t theo tng nhm 60 9.3 Kim nh t (t.test) 61 9.3.1 Kim nh t mt mu 61 9.3.2 Kim nh t hai mu 62 9.4 Kim nh Wilcoxon cho hai mu (wilcox.test) 63 9.5 Kim nh t cho cc bin s theo cp (paired t-test, t.test) 64 9.6 Kim nh Wilcoxon cho cc bin s theo cp (wilcox.test) 65 9.7 Tn s (frequency) 66 9.8 Kim nh t l (proportion test, prop.test, binom.test) 67 9.9 So snh hai t l (prop.test, binom.test) 68 9.10 So snh nhiu t l (prop.test, chisq.test) 69 9.10.1 Kim nh Chi bnh phng (Chi squared test, chisq.test) 70 9.10.2 Kim nh Fisher (Fishers exact test, fisher.test) 71 10 Phn tch hi qui tuyn tnh 71 10.1 H s tng quan 73 10.1.1 H s tng quan Pearson 73 10.1.2 H s tng quan Spearman 74 10.1.3 H s tng quan Kendall 74 10.2 M hnh ca hi qui tuyn tnh n gin 75 10.3 M hnh hi qui tuyn tnh a bin (multiple linear regression) 82 11 Phn tch phng sai 85 11.1 Phn tch phng sai n gin (one-way analysis of variance) 85 11.2 So snh nhiu nhm v iu chnh tr s p 87 11.3 Phn tch bng phng php phi tham s 90 11.4 Phn tch phng sai hai chiu (two-way ANOVA) 91 12 Phn tch hi qui logistic 94 12.1 M hnh hi qui logistic 95 12.2 Phn tch hi qui logistic bng R 97 12.3 c tnh xc sut bng R 101 13 c tnh c mu (sample size estimation) 103 13.1 Khi nim v power 104 13.2 S liu c tnh c mu 106 13.4 c tnh c mu 107 13.4.1 c tnh c mu cho mt ch s trung bnh 107 13.4.2 c tnh c mu cho so snh hai s trung bnh 108 13.4.3 c tnh c mu cho phn tch phng sai 110 13.4.4 c tnh c mu c tnh mt t l 111 13.4.5 c tnh c mu cho so snh hai t l 112 14 Ti liu tham kho 115 15 Thut ng dng trong sch 117

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    4

    Gii thiu R

    Phn tch s liu v biu thng c tin hnh bng cc phn mm thng dng nh SAS, SPSS, Stata, Statistica, v S-Plus. y l nhng phn mm c cc cng ti phn mm pht trin v gii thiu trn th trng khong ba thp nin qua, v c cc trng i hc, cc trung tm nghin cu v cng ti k ngh trn ton th gii s dng cho ging dy v nghin cu. Nhng v chi ph s dng cc phn mm ny tung i t tin (c khi ln n hng trm ngn -la mi nm), mt s trng i hc cc nc ang pht trin (v ngay c mt s nc pht trin) khng c kh nng ti chnh s dng chng mt cch lu di. Do , cc nh nghin cu thng k trn th gii hp tc vi nhau pht trin mt phn mm mi, vi ch trng m ngun m, sao cho tt c cc thnh vin trong ngnh thng k hc v ton hc trn th gii c th s dng mt cch thng nht v hon ton min ph.

    Nm 1996, trong mt bi bo quan trng v tnh ton thng k, hai nh thng k

    hc Ross Ihaka v Robert Gentleman [lc ] thuc Trng i hc Auckland, New Zealand pht ho mt ngn ng mi cho phn tch thng k m h t tn l R [1]. Sng kin ny c rt nhiu nh thng k hc trn th gii tn thnh v tham gia vo vic pht trin R.

    Cho n nay, qua cha y 10 nm pht trin, cng ngy cng c nhiu nh thng

    k hc, ton hc, nghin cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c mt mng li hn mt triu ngi s dng R, v con s ny ang tng rt nhanh. C th ni trong vng 10 nm na, vai tr ca cc phn mm thng k thng mi s khng cn ln nh trong thi gian qua na.

    Vy R l g? Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch

    thng k v v biu . Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn cho mt vn tnh ton c bit.

    V th, nhng ai lm nghin cu khoa hc, nht l cc nc cn ngho kh nh

    nc ta, cn phi hc cch s dng R cho phn tch thng k v th. Bi vit ngn ny s hng dn bn c cch s dng R. Ti gi nh rng bn c khng bit g v R, nhng ti k vng bn c bit qua v cch s dng my tnh.

    1. Ti R xung v ci t vo my tnh

    s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh. lm vic ny, ta phi truy nhp vo mng v vo website c tn l Comprehensive R Archive Network (CRAN) sau y:

    http://cran.R-project.org.

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    5

    Ti liu cn ti v, ty theo phin bn, nhng thng c tn bt u bng mu t

    R v s phin bn (version). Chng hn nh phin bn ti s dng vo cui nm 2005 l 2.2.1, nn tn ca ti liu cn ti l:

    R-2.2.1-win32.zip Ti liu ny khong 26 MB, v a ch c th ti l:

    http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe

    Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch s dng

    R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh, ti liu ny ca ti c th cung cp nhng thng tin cn thit s dng m khng cn phi c cc ti liu khc.

    Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh. lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm theo hng dn cch ci t trn mn hnh. y l mt bc rt n gin, ch cn 1 pht l vic ci t R c th hon tt.

    Sau khi hon tt vic ci t, mt icon

    R 2.2.1.lnk s xut hin trn desktop ca my tnh. n y th chng ta sn sng s dng R. C th nhp chut vo icon ny v chng ta s c mt window nh sau:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    6

    2. Ti R package v ci t vo my tnh

    R cung cp cho chng ta mt ngn ng my tnh v mt s function lm cc phn tch cn bn v n gin. Nu mun lm nhng phn tch phc tp hn, chng ta cn phi ti v my tnh mt s package khc. Package l mt phn mm nh c cc nh thng k pht trin gii quyt mt vn c th, v c th chy trong h thng R. Chng hn nh phn tch hi qui tuyn tnh, R c function lm s dng cho mc ch ny, nhng lm cc phn tch su hn v phc tp hn, chng ta cn n cc package nh lme4. Cc package ny cn phi c ti v v ci t vo my tnh.

    a ch ti cc package vn l: http://cran.r-project.org, ri bm vo phn Packages xut hin bn tri ca mc lc trang web. Theo ti, mt s package cn ti v my tnh s dng cho cc phn tch dch t hc l: Tn package Chc nng trellis Dng v th v lm cho th p hn lattice Dng v th v lm cho th p hn Hmisc Mt s phng php m hnh d liu ca F. Harrell Design Mt s m hnh thit k nghin cu ca F. Harrell Epi Dng cho cc phn tch dch t hc epitools Mt package khc chuyn cho cc phn tch dch t hc Foreign Dng nhp d liu t cc phn mm khc nh

    SPSS, Stata, SAS, v.v Rmeta Dng cho phn tch tng hp (meta-analysis) meta Mt package khc cho phn tch tng hp

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    7

    survival Chuyn dng cho phn tch theo m hnh Cox (Coxs proportional hazard model)

    Zelig Package dng cho cc phn tch thng k trong lnh vc x hi hc

    Genetics Package dng cho phn tch s liu di truyn hc BMA Bayesian Model Average Cc package ny c th ci t trc tuyn bng cch chn Install packages trong phn packages ca R nh hnh di y. Ngoi ra, nu package c ti xung my tnh c nhn, vic ci t c th nhanh hn bng cch chn Install package(s) from local zip file cng trong phn packages (xem hnh di y).

    3. Vn phm R R l mt ngn ng tng tc (interactive language), c ngha l khi chng ta ra lnh, v nu lnh theo ng vn phm, R s p li bng mt kt qu. V, s tng tc tip tc cho n khi chng ta t c yu cu. Vn phm chung ca R l mt lnh (command) hay function (ti s thnh thong cp n l hm). M l hm th phi c thng s; cho nn theo sau hm l nhng thng s m chng ta phi cung cp. C php chung ca R l nh sau:

    i tng

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    8

    Chng hn nh: > reg setwd(c:/works/stats) th setwd l mt hm, cn c:/works/stats l thng s ca hm.

    bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args vit tt ch arguments) m trong x l mt hm chng ta cn bit: > args(lm) function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) NULL R l mt ngn ng i tng (object oriented language). iu ny c ngha l cc d liu trong R c cha trong object. nh hng ny cng c vi nh hng n cch vit ca R. Chng hn nh thay v vit x = 5 nh thng thng chng ta vn vit, th R yu cu vit l x == 5.

    i vi R, x = 5 tng ng vi x

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    9

    Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l k hiu dnh cho ngi s dng thm vo cc ghi ch, v d: > # lnh sau y s m phng 10 gi tr normal > x myobject my object my.object My.object.u my.object.L My.object.u + my.object.L [1] 20 Mt vi iu cn lu khi t tn trong R l:

    Khng nn t tn mt bin s hay variable bng k hiu _ (underscore) nh my_object hay my-object.

    Khng nn t tn mt object ging nh mt bin s trong mt d liu. V d,

    nu chng ta c mt data.frame (d liu hay dataset) vi bin s age trong , th khng nn c mt object trng tn age, tc l khng nn vit: age

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    10

    Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu vn phm ca tng hm. Chng hn nh mun bit hm lm c nhng thng s (arguments) no, chng ta ch n gin lnh: > help(lm) hay > ?lm Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch c c v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh. Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua phn ch dn c sn trong R bng cch chn mc help v sau chn Html help nh hnh di y bit thm chi tit. Bn c cng c th copy v dn cc lnh trong mc ny vo R xem cho bit cch vn hnh ca R. 4. Cch nhp d liu vo R

    Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng m R c th hiu c x l. D liu m R hiu c phi l d liu trong mt data.frame. C nhiu cch nhp s liu vo mt data.frame trong R, t nhp trc tip n nhp t cc ngun khc nhau. Sau y l nhng cch thng dng nht: 4.1 Nhp s liu trc tip: c()

    V d 1: chng ta c s liu v tui v insulin cho 10 bnh nhn nh sau, v mun nhp vo R. 50 16.5 62 10.8 60 32.3 40 19.3 48 14.2 47 11.3 57 15.5 70 15.8 48 16.2 67 11.2 Chng ta c th s dng function c tn c nh sau: > age insulin

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    11

    Lnh th nht cho R bit rng chng ta mun to ra mt ct d liu (t nay ti s gi l bin s, tc variable) c tn l age, v lnh th hai l to ra mt ct khc c tn l insulin. Tt nhin, chng ta c th ly mt tn khc m mnh thch.

    Chng ta dng function c (vit tt ca ch concatenation c ngha l mc ni vo nhau) nhp d liu. Ch rng mi s liu cho mi bnh nhn c cch nhau bng mt du phy.

    K hiu insulin tuan tuan V R s bo co: age insulin 1 50 16.5 2 62 10.8 3 60 32.3 4 40 19.3 5 48 14.2 6 47 11.3 7 57 15.5 8 70 15.8 9 48 16.2 10 67 11.2

    Nu chng ta mun lu li cc s liu ny trong mt file theo dng R, chng ta cn dng lnh save. Gi d nh chng ta mun lu s liu trong directory c tn l c:\works\insulin, chng ta cn g nh sau: > setwd(c:/works/insulin) > save(tuan, file=tuan.rda)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    12

    Lnh u tin (setwd ch wd c ngha l working directory) cho R bit rng

    chng ta mun lu cc s liu trong directory c tn l c:\works\insulin. Lu rng thng thng Windows dng du backward slash /, nhng trong R chng ta dng du forward slash /.

    Lnh th hai (save) cho R bit rng cc s liu trong i tng tuan s lu trong file c tn l tuan.rda). Sau khi g xong hai lnh trn, mt file c tn tuan.rda s c mt trong directory . 4.2 Nhp s liu trc tip: edit(data.frame())

    V d 1 (tip tc): chng ta c th nhp s liu v tui v insulin cho 10 bnh nhn bng mt function rt c ch, l: edit(data.frame()). Vi function ny, R s cung cp cho chng ta mt window mi vi mt dy ct v dng ging nh Excel, v chng ta c th nhp s liu trong bng . V d: > ins

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    13

    tng ct. Sau khi xong, bm nt cho X gc phi ca spreadsheet, chng ta s c mt data.frame tn ins vi hai bin s age v insulin. 4.3 Nhp s liu t mt text file: read.table

    V d 2: Chng ta thu thp s liu v tui v cholesterol t mt nghin cu 50 bnh nhn mc bnh cao huyt p. Cc s liu ny c lu trong mt text file c tn l chol.txt ti directory c:\works\insulin. S liu ny nh sau: ct 1 l m s ca bnh nhn, ct 2 l gii tnh, ct 3 l body mass index (bmi), ct 4 l HDL cholesterol (vit tt l hdl), k n l LDL cholesterol, total cholesterol (tc) v triglycerides (tg). id sex age bmi hdl ldl tc tg 1 Nam 57 17 5.000 2.0 4.0 1.1 2 Nu 64 18 4.380 3.0 3.5 2.1 3 Nu 60 18 3.360 3.0 4.7 0.8 4 Nam 65 18 5.920 4.0 7.7 1.1 5 Nam 47 18 6.250 2.1 5.0 2.1 6 Nu 65 18 4.150 3.0 4.2 1.5 7 Nam 76 19 0.737 3.0 5.9 2.6 8 Nam 61 19 7.170 3.0 6.1 1.5 9 Nam 59 19 6.942 3.0 5.9 5.4 10 Nu 57 19 5.000 2.0 4.0 1.9 ... 46 Nu 52 24 3.360 2.0 3.7 1.2 47 Nam 64 24 7.170 1.0 6.1 1.9 48 Nam 45 24 7.880 4.0 6.7 3.3 49 Nu 64 25 7.360 4.6 8.1 4.0 50 Nu 62 25 7.750 4.0 6.2 2.5

    Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau ny. Chng ta s s dng lnh read.table nh sau: > setwd(c:/works/insulin) > chol chol Hay

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    14

    > names(chol) R s cho bit c cc ct nh sau trong d liu (names l lnh hi trong d liu c nhng ct no v tn g): [1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg" By gi chng ta c th lu d liu di dng R x l sau ny bng cch ra lnh: > save(chol, file="chol.rda") 4.4 Nhp s liu t Excel: read.csv nhp s liu t phn mm Excel, chng ta cn tin hnh 2 bc:

    Bc 1: Dng lnh Save as trong Excel v lu s liu di dng csv; Bc 2: Dng R (lnh read.csv) nhp d liu dng csv.

    V d 3: Mt d liu gm cc ct sau y ang c lu trong Excel, v chng ta mun chuyn vo R phn tch. D liu ny c tn l excel.xls.

    ID Age Sex Ethnicity IGFI IGFBP3 ALS PINP ICTP P3NP 1 18 1 1 148.27 5.14 316.00 61.84 5.81 4.21 2 28 1 1 114.50 5.23 296.42 98.64 4.96 5.33 3 20 1 1 109.82 4.33 269.82 93.26 7.74 4.56 4 21 1 1 112.13 4.38 247.96 101.59 6.66 4.61 5 28 1 1 102.86 4.04 240.04 58.77 4.62 4.95 6 23 1 4 129.59 4.16 266.95 48.93 5.32 3.82 7 20 1 1 142.50 3.85 300.86 135.62 8.78 6.75 8 20 1 1 118.69 3.44 277.46 79.51 7.19 5.11 9 20 1 1 197.69 4.12 335.23 57.25 6.21 4.44 10 20 1 1 163.69 3.96 306.83 74.03 4.95 4.84 11 22 1 1 144.81 3.63 295.46 68.26 4.54 3.70 12 27 0 2 141.60 3.48 231.20 56.78 4.47 4.07 13 26 1 1 161.80 4.10 244.80 75.75 6.27 5.26 14 33 1 1 89.20 2.82 177.20 48.57 3.58 3.68 15 34 1 3 161.80 3.80 243.60 50.68 3.52 3.35 16 32 1 1 148.50 3.72 234.80 83.98 4.85 3.80 17 28 1 1 157.70 3.98 224.80 60.42 4.89 4.09 18 18 0 2 222.90 3.98 281.40 74.17 6.43 5.84 19 26 0 2 186.70 4.64 340.80 38.05 5.12 5.77 20 27 1 2 167.56 3.56 321.12 30.18 4.78 6.12

    Vic u tin l chng ta cn lm, nh ni trn, l vo Excel lu di dng csv:

    Vo Excel, chn File Save as Chn Save as type CSV (Comma delimited)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    15

    Sau khi xong, chng ta s c mt file vi tn excel.csv trong directory c:\works\insulin. Vic th hai l vo R v ra nhng lnh sau y: > setwd(c:/works/insulin) > gh save(gh, file="gh.rda") 4.5 Nhp s liu t mt SPSS: read.spss

    Phn mm thng k SPSS lu d liu di dng sav. Chng hn nh nu chng ta c mt d liu c tn l testo.sav trong directory c:\works\insulin, v mun chuyn d liu ny sang dng R c th hiu c, chng ta cn s dng lnh read.spss trong package c tn l foreign. Cc lnh sau y s hon tt d dng vic ny: Vic u tin chng ta cho truy nhp foreign bng lnh library:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    16

    > library(foreign) Vic th hai l lnh read.spss: > setwd(c:/works/insulin) > testo save(testo, file="testo.rda") 4.6 Thng tin v d liu Gi d nh chng ta nhp s liu vo mt data.frame c tn l chol nh trong v d 1. tm hiu xem trong d liu ny c g, chng ta c th nhp vo R nh sau: Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi

    arg l tn ca d liu.. > attach(chol) Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh

    is.data.frame(arg) vi arg l tn ca d liu. V d: > is.data.frame(chol) [1] TRUE R cho bit chol qu l mt data.frame. C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu

    ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):

    > dim(chol) [1] 50 8 Nh vy, chng ta c 50 dng v 8 ct (hay bin s). Vy nhng bin s ny tn g?

    Chng ta dng lnh names(arg) vi arg l tn ca d liu. V d: > names(chol) [1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg"

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    17

    Trong bin s sex, chng ta c bao nhiu nam v n? tr li cu hi ny, chng ta c th dng lnh table(arg) vi arg l tn ca bin s. V d:

    > table(sex) sex nam Nam Nu 1 21 28 Kt qu cho thy d liu ny c 21 nam v 28 n. 4.7 To dy s bng hm seq, rep v gl

    R cn c cng dng to ra nhng dy s rt tin cho vic m phng v thit k th nghim. Nhng hm thng thng cho dy s l seq (sequence), rep (repetition) v gl (generating levels): p dng seq To ra mt vector s t 1 n 12: > x x [1] 1 2 3 4 5 6 7 8 9 10 11 12 > seq(12) [1] 1 2 3 4 5 6 7 8 9 10 11 12 To ra mt vector s t 12 n 5: > x x [1] 12 11 10 9 8 7 6 5 > seq(12,7) [1] 12 11 10 9 8 7 Cng thc chung ca hm seq l seq(from, to, by= ) hay seq(from, to, length.out= ). Cch s dng s c minh ho bng vi v d sau y: To ra mt vector s t 4 n 6 vi khong cch bng 0.25: > seq(4, 6, 0.25) [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 To ra mt vector 10 s, vi s nh nht l 2 v s ln nht l 15 > seq(length=10, from=2, to=15) [1] 2.000000 3.444444 4.888889 6.333333 7.777778 9.222222 10.666667 12.111111 13.555556 15.000000

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    18

    p dng rep Cng thc ca hm rep l rep(x, times, ...), trong , x l mt bin s v times l s ln lp li. V d: To ra s 10, 3 ln: > rep(10, 3) [1] 10 10 10 To ra s 1 n 4, 3 ln: > rep(c(1:4), 3) [1] 1 2 3 4 1 2 3 4 1 2 3 4 To ra s 1.2, 2.7, 4.8, 5 ln: > rep(c(1.2, 2.7, 4.8), 5) [1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 To ra s 1.2, 2.7, 4.8, 5 ln: > rep(c(1.2, 2.7, 4.8), 5) [1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 p dng gl gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k, labels = 1:n, ordered = FALSE) v cch s dng s c minh ho bng vi v d sau y: To ra bin gm bc 1 v 2; mi bc c lp li 8 ln: > gl(2, 8) [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 Levels: 1 2 Hay mt bin gm bc 1, 2 v 3; mi bc c lp li 5 ln: > gl(3, 5) [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 To ra bin gm bc 1 v 2; mi bc c lp li 10 ln (do length=20): > gl(2, 10, length=20) [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Levels: 1 2 Hay: > gl(2, 2, length=20) [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 Levels: 1 2 Cho thm k hiu:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    19

    > gl(2, 5, label=c("C", "T")) [1] C C C C C T T T T T Levels: C T To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln. > rep(1:4, c(2,2,2,2)) [1] 1 1 2 2 3 3 4 4 Cng tng ng vi: > rep(1:4, each = 2) [1] 1 1 2 2 3 3 4 4 Vi ngy gi thng: > x rep(x, 2) [1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-12-31 16:00:00 Pacific Standard Time" [3] "1973-12-31 16:00:00 Pacific Standard Time" "1972-06-30 17:00:00 Pacific Standard Time" [5] "1972-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00 Pacific Standard Time" > rep(as.POSIXlt(x), rep(2, 3)) [1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-06-30 17:00:00 Pacific Standard Time" [3] "1972-12-31 16:00:00 Pacific Standard Time" "1972-12-31 16:00:00 Pacific Standard Time" [5] "1973-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00 Pacific Standard Time" 5. Bin tp s liu 5.1 Tch ri d liu: subset

    Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R c tn l chol t mt text file c tn l chol.txt: > setwd(c:/works/insulin) > chol attach(chol)

    Nu chng ta, v mt l do no , ch mun phn tch ring cho nam gii, chng ta c th tch chol ra thnh hai data.frame, tm gi l nam v nu. lm chuyn ny, chng ta dng lnh subset(data, cond), trong data l data.frame m chng ta mun tch ri, v cond l iu kin. V d: > nam nu

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    20

    Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu. Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch iu kin chnh xc. Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame mi tn l old vi nhng bnh nhn trn 60 tui: > old =60) > dim(old) [1] 25 8 Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii: > n60 =60 & sex==Nam) > dim(n60) [1] 9 8 5.2 Chit s liu t mt data .frame

    Trong chol c 8 bin s. Chng ta c th chit d liu chol v ch gi li nhng bin s cn thit nh m s (id), tui (age) v total cholestrol (tc). t lnh names(chol) rng bin s id l ct s 1, age l ct s 3, v bin s tc l ct s 7. Chng ta c th dng lnh sau y: > data2 data3 print(data3) id sex tc 1 1 Nam 4.0 2 2 Nu 3.5 3 3 Nu 4.7 4 4 Nam 7.7 5 5 Nam 5.0 6 6 Nu 4.2 7 7 Nam 5.9 8 8 Nam 6.1

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    21

    9 9 Nam 5.9 10 10 Nu 4.0 Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra, chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3). 5.3 Nhp hai data.frame thnh mt: merge Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1 gm 3 ct: id, sex, tc nh sau: id sex tc 1 Nam 4.0 2 Nu 3.5 3 Nu 4.7 4 Nam 7.7 5 Nam 5.0 6 Nu 4.2 7 Nam 5.9 8 Nam 6.1 9 Nam 5.9 10 Nu 4.0 D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau: id sex tg 1 Nam 1.1 2 Nu 2.1 3 Nu 0.8 4 Nam 1.1 5 Nam 2.1 6 Nu 1.5 7 Nam 2.6 8 Nam 1.5 9 Nam 5.4 10 Nu 1.9 11 Nu 1.7 Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng, cn d liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt data.frame bng cch dng lnh merge nh sau: > d d id sex.x tc sex.y tg

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    22

    1 1 Nam 4.0 Nam 1.1 2 2 Nu 3.5 Nu 2.1 3 3 Nu 4.7 Nu 0.8 4 4 Nam 7.7 Nam 1.1 5 5 Nam 5.0 Nam 2.1 6 6 Nu 4.2 Nu 1.5 7 7 Nam 5.9 Nam 2.6 8 8 Nam 6.1 Nam 1.5 9 9 Nam 5.9 Nam 5.4 10 10 Nu 4.0 Nu 1.9 11 11 NA Nu 1.7 Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v a vo data.frame mi tn l d, v dng bin s id lm chun. Chng ta thy bnh nhn s 11 khng c s liu cho tc, cho nn R cho l NA (mt dng not available). 5.4 Bin i s liu (data coding) Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin i s liu t bin lin tc sang bin mang tnh cch phn loi. Chng hn nh trong chn on long xng, nhng ph n c ch s T ca mt cht khong trong xng (bone mineral density hay BMD) bng hay thp hn -2.5 c xem l long xng, nhng ai c BMD gia -2.5 v -1.0 l xp xng (osteopenia), v trn -1.0 l bnh thng. V d, chng ta c s liu BMD t 10 bnh nhn nh sau: -0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11 nhp cc s liu ny vo R chng ta c th s dng function c nh sau: bmd diagnosis diagnosis[bmd -2.5 & bmd -1.0] data data

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    23

    bmd diagnosis 1 -0.92 3 2 0.21 3 3 0.17 3 4 -3.21 1 5 -1.80 2 6 -2.60 1 7 -2.00 2 8 1.71 3 9 2.12 3 10 -2.11 2 5.5 Bin i s liu bng cch dng replace Mt cch bin i s liu khc l dng replace, d cch ny c v rm r cht t. Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau: > diagnosis diagnosis diag diag [1] 3 3 3 1 2 1 2 3 3 2 Levels: 1 2 3 Ch R by gi thng bo cho chng ta bit diag c 3 bc: 1, 2 v 3. Nu chng ta yu cu R tnh s trung bnh ca diag, R s khng lm theo yu cu ny, v khng phi l mt bin s s hc: > mean(diag) [1] NA Warning message: argument is not numeric or logical: returning NA in: mean.default(diag) D nhin, chng ta c th tnh gi tr trung bnh ca diagnosis:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    24

    > mean(diagnosis) [1] 2.3 nhng kt qu 2.3 ny khng c ngha g trong thc t c. 5.7 Phn nhm s liu bng cut2 (Hmisc) Trong phn tch thng k, c khi chng ta cn phi phn chia mt bin s lin tc thnh nhiu nhm da vo phn phi ca bin s. Chng hn nh i vi bin s bmd chng ta c th ct dy s thnh 3 nhm tng ng nhau bng cch dng function cut2 (trong th vin Hmisc) nh sau: > # nhp th vin Hmisc c th dng function cut2 > library(Hmisc) > bmd # chia bin s bmd thnh 2 nhm v trong i tng group > group table(group) group [-3.21,-0.92) [-0.92, 2.12] 5 5 Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g = group). R t ng chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92 n 2.12. Mi nhm gm c 5 s. Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh: > group table(group) group [-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12] 4 3 3 6. S dng R cho tnh ton n gin

    Mt trong nhng li th ca R l c th s dng nh mt my tnh cm tay. Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp chng. Trong chng ny ti ch trnh by mt s php tnh n gin m hc sinh hay sinh vin c th s dng lp tc trong khi c nhng dng ch ny. 6.1 Tnh ton n gin

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    25

    Cng hai s hay nhiu s vi nhau: > 15+2997 [1] 3012

    Cng v tr: > 15+2997-9768 [1] -6756

    Nhn v chia > -27*12/21 [1] -15.42857

    S ly tha: (25 5)3 > (25 - 5)^3 [1] 8000

    Cn s bc hai: 10 > sqrt(10) [1] 3.162278

    S pi () > pi [1] 3.141593 > 2+3*pi [1] 11.42478

    Logarit: loge > log(10) [1] 2.302585

    Logarit: log10 > log10(100) [1] 2

    S m: e2.7689 > exp(2.7689) [1] 15.94109 > log10(2+3*pi) [1] 1.057848

    Hm s lng gic > cos(pi) [1] -1

    Vector > x x [1] 2 3 1 5 4 6 7 6 8 > sum(x) [1] 42 > x*2 [1] 4 6 2 10 8 12 14 12 16

    > exp(x/10) [1] 1.221403 1.349859 1.105171 1.6481.491825 1.822119 2.013753 1.822119[9] 2.225541 > exp(cos(x/10)) [1] 2.664634 2.599545 2.704736 2.4052.511954 2.282647 2.148655 2.282647[9] 2.007132

    Tnh tng bnh phng (sum of squares): 12 + 22 + 32 + 42 + 52 = ? > x sum(x^2) [1] 55

    Tnh tng bnh phng iu chnh

    (adjusted sum of squares): ( )21

    n

    ii

    x x=

    = ? > x sum((x-mean(x))^2) [1] 10 Trong cng thc trn mean(x) l s trung bnh ca vector x.

    Tnh sai s bnh phng (mean square): Tnh phng sai (variance) v lch chun (standard deviation):

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    26

    ( )21

    /n

    ii

    x x n=

    = ? > x sum((x-mean(x))^2)/length(x) [1] 2 Trong cng thc trn, length(x) c ngha l tng s phn t (elements) trong vector x.

    Phng sai: ( ) ( )221

    / 1n

    ii

    s x x n=

    = = ? > x var(x) [1] 2.5 lch chun: 2s : > sd(x) [1] 1.581139

    6.2 S dng R cho cc php tnh ma trn

    Nh chng ta bit ma trn (matrix), ni n gin, gm c dng (row) v ct (column). Khi vit A[m, n], chng ta hiu rng ma trn A c m dng v n ct. Trong R, chng ta cng c th th hin nh th. V d: chng ta mun to mt ma trn vung A gm 3 dng v 3 ct, vi cc phn t (element) 1, 2, 3, 4, 5, 6, 7, 8, 9, chng ta vit:

    1 4 72 5 83 6 9

    A =

    V vi R: > y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 Nhng nu chng ta lnh: > A A th kt qu s l: [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 Tc l mt ma trn chuyn v (transposed matrix). Mt cch khc to mt ma trn hon v l dng t(). V d:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    27

    > y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 v B = A' c th din t bng R nh sau: > B B [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho l 1. Chng ta c th to mt ma trn nh th bng R nh sau: > # to ra m ma trn 3 x 3 vi tt c phn t l 0. > A # cho cc phn t ng cho bng 1 > diag(A) diag(A) [1] 1 1 1 > # by gi ma trn A s l: > A [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1 6.2.1 Chit phn t t ma trn > y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > # ct 1 ca ma trn A > A[,1]

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    28

    [1] 1 4 7 > # ct 3 ca ma trn A > A[3,] [1] 7 8 9 > # dng 1 ca ma trn A > A[1,] [1] 1 2 3 > # dng 2, ct 3 ca ma trn A > A[2,3] [1] 6 > # tt c cc dng ca ma trn A, ngoi tr dng 2 > A[-2,] [,1] [,2] [,3] [1,] 1 4 7 [2,] 3 6 9 > # tt c cc ct ca ma trn A, ngoi tr ct 1 > A[,-1] [,1] [,2] [1,] 4 7 [2,] 5 8 [3,] 6 9 > # xem phn t no cao hn 3. > A>3 [,1] [,2] [,3] [1,] FALSE TRUE TRUE [2,] FALSE TRUE TRUE [3,] FALSE TRUE TRUE 6.2.2 Tnh ton vi ma trn Cng v tr hai ma trn. Cho hai ma trn A v B nh sau: > A A [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > B B [,1] [,2] [,3] [,4] [1,] -1 -4 -7 -10

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    29

    [2,] -2 -5 -8 -11 [3,] -3 -6 -9 -12 Chng ta c th cng A+B: > C C [,1] [,2] [,3] [,4] [1,] 0 0 0 0 [2,] 0 0 0 0 [3,] 0 0 0 0 Hay A-B: > D D [,1] [,2] [,3] [,4] [1,] 2 8 14 20 [2,] 4 10 16 22 [3,] 6 12 18 24 Nhn hai ma trn. Cho hai ma trn:

    1 4 72 5 83 6 9

    A =

    v 1 2 34 5 67 8 9

    B =

    Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau: > y A B AB AB [,1] [,2] [,3] [1,] 66 78 90 [2,] 78 93 108 [3,] 90 108 126 Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau: > BA BA [,1] [,2] [,3] [1,] 14 32 50 [2,] 32 77 122 [3,] 50 122 194

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    30

    Nghch o ma trn v gii h phng trnh. V d chng ta c h phng trnh sau y:

    1 2

    1 2

    3 4 46 2

    x xx x

    + =+ =

    H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :

    3 41 6

    A = , 1

    2

    xX

    x =

    , v 42

    Y = Nghim ca h phng trnh ny l: X = A-1Y, hay trong R: > A Y X X [,1] [1,] 1.1428571 [2,] 0.1428571 Chng ta c th kim tra: > 3*X[1,1]+4*X[2,1] [1] 4 Tr s eigen cng c th tnh ton bng function eigen nh sau: > eigen(A) $values [1] 7 2 $vectors [,1] [,2] [1,] -0.7071068 -0.9701425 [2,] -0.7071068 0.2425356 nh thc (determinant). Lm sao chng ta xc nh mt ma trn c th o nghch hay khng? Ma trn m nh thc bng 0 l ma trn suy bin (singular matrix) v khng th o nghch. kim tra nh thc, R dng lnh det(): > E E [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    31

    > det(E) [1] 0 Nhng ma trn F sau y th c th o nghch: > F F [,1] [,2] [,3] [1,] 1 16 49 [2,] 4 25 64 [3,] 9 36 81 > det(F) [1] -216 V nghch o ca ma trn F (F-1) c th tnh bng function solve() nh sau: > solve(F) [,1] [,2] [,3] [1,] 1.291667 -2.166667 0.9305556 [2,] -1.166667 1.666667 -0.6111111 [3,] 0.375000 -0.500000 0.1805556

    Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t do to ra nhng php tnh ph hp cho tng vn c th. R c mt package Matrix chuyn thit k cho tnh ton ma trn. Bn c c th ti package xung, ci vo my, v s dng, nu cn. a ch ti l: http://cran.au.r-project.org/bin/windows/contrib/r-release/Matrix_0.995-8.zip cng vi ti liu ch dn cch s dng (di khong 80 trang): http://cran.au.r-project.org/doc/packages/Matrix.pdf. 7. S dng R cho tnh ton xc sut 7.1 Php hon v (permutation)

    Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh hon v cho mt s n l: ( )( )( )! 1 2 3 ... 1n n n n n= . Trong R cch tnh ny rt n gin vi lnh prod() nh sau: Tm 3! > prod(3:1) [1] 6 Tm 10! > prod(10:1) [1] 3628800

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    32

    Tm 10.9.8.7.6.5.4 > prod(10:4) [1] 604800 Tm (10.9.8.7.6.5.4) / (40.39.38.37.36) > prod(10:4) / prod(40:36) [1] 0.007659481 7.2 T hp (combination)

    S ln chn k ngi t n phn t l: ( )!

    ! !n nk k n k

    = . Cng thc ny cng c khi vit l

    nkC thay v

    nk

    . Vi R, php tnh ny rt n gin bng hm choose(n, k). Sau y l vi v d minh ha:

    Tm 52

    > choose(5, 2) [1] 10 Tm xc sut cp A v B trong s 5 ngi c c c vo hai chc v: > 1/choose(5, 2) [1] 0.1 7.3 Bin s ngu nhin v hm phn phi Khi ni n phn phi (hay distribution) l cp n cc gi tr m bin s c th c. Cc hm phn phi (distribution function) l hm nhm m t cc bin s mt cch c h thng. C h thng y c ngha l theo m m hnh ton hc c th vi nhng thng s cho trc. Trong xc sut thng k c kh nhiu hm phn phi, v y chng ta s xem xt qua mt s hm quan trng nht v thng dng nht: l phn phi nh phn, phn phi Poisson, v phn phi chun. Trong mi lut phn phi, c 4 loi hm quan trng m chng ta cn bit:

    hm mt xc sut (probability density distribution); hm phn phi tch ly (cumulative probability distribution); hm nh bc (quantile); v hm m phng (simulation).

    R c nhng hm sn trn c th ng dng cho tnh ton xc sut. Tn mi hm

    c gi bng mt tip u ng ch loi hm phn phi, v vit tt tn ca hm . Cc tip u ng l d (ch distribution hay xc sut), p (ch cumulative probability, xc sut tch ly), q (ch nh bc hay quantile), v r (ch random hay s ngu nhin). Cc

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    33

    tn vit tt l norm (normal, phn phi chun), binom (binomial , phn phi nh phn), pois (Poisson, phn phi Poisson), v.v Bng sau y tm tt cc hm v thng s cho tng hm: Hm phn phi

    Mt Tch ly nh bc M phng

    Chun dnorm(x, mean, sd) pnorm(q, mean, sd) qnorm(p, mean, sd) rnorm(n, mean, sd)

    Nh phn dbinom(k, n, p) pbinom(q, n, p) qbinom (p, n, p) rbinom(k, n, prob) Poisson dpois(k, lambda) ppois(q, lambda) qpois(p, lambda) rpois(n, lambda) Uniform dunif(x, min, max)

    punif(q, min, max) qunif(p, min, max) runif(n, min, max)

    Negative binomial

    dnbinom(x, k, p) pnbinom(q, k, p) qnbinom (p,k,prob) rbinom(n, n, prob)

    Beta dbeta(x, shape1, shape2) pbeta(q, shape1, shape2)

    qbeta(p, shape1, shape2)

    rbeta(n, shape1, shape2)

    Gamma dgamma(x, shape, rate, scale) gamma(q, shape, rate, scale)

    qgamma(p, shape, rate, scale)

    rgamma(n, shape, rate, scale)

    Geometric dgeom(x, p) pgeom(q, p) qgeom(p, prob) rgeom(n, prob) Exponential dexp(x, rate) pexp(q, rate) qexp(p, rate) rexp(n, rate) Weibull dnorm(x, mean, sd)

    pnorm(q, mean, sd) qnorm(p, mean, sd) rnorm(n, mean, sd)

    Cauchy dcauchy(x, location, scale)

    pcauchy(q, location, scale)

    qcauchy(p, location, scale)

    rcauchy(n, location, scale)

    F df(x, df1, df2) pf(q, df1, df2) qf(p, df1, df2) rf(n, df1, df2) T dt(x, df) pt(q, df) qt(p, df) rt(n, df) Chi-squared dchisq(x, df) pchi(q, df) qchisq(p, df) rchisq(n, df) Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0. Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn. 7.3.1 Hm phn phi nh phn (Binomial distribution) Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng / cht, c / khng, v.v Hm nh phn c pht biu bng nh l nh sau: Nu mt th nghim c tin hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm xc sut thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l:

    ( ) ( )| , 1 n kn kkP k n p C p p = , trong k = 0, 1, 2, . . . , n. Trong R, c hm dbinom(k, n, p) c th gip chng ta tnh cng thc ( ) ( )| , 1 n kn kkP k n p C p p = mt cch nhanh chng. Trong trng hp trn, chng ta ch cn n gin lnh: > dbinom(2, 3, 0.60) [1] 0.432 V d 2: Hm nh phn tch ly (Cumulative Binomial probability distribution). Xc sut thuc chng long xng c hiu nghim l khong 70% (tc l p = 0.70). Nu chng ta iu tr 10 bnh nhn, xc sut c ti thiu 8 bnh nhn vi kt qu tch cc l bao nhiu? Ni cch khc, nu gi X l s bnh nhn c iu tr thnh cng, chng ta cn tm P(X 8) = ? tr li cu hi ny, chng ta s dng hm

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    34

    pbinom(k, n, p). Xin nhc li rng hm pbinom(k, n, p)cho chng ta P(X k). Do , P(X 8) = 1 P(X 7). Thnh ra, p s bng R cho cu hi l: > 1-pbinom(7, 10, 0.70) [1] 0.3827828

    V d 3: M phng hm nh phn: Bit rng trong mt qun th dn s c khong 20% ngi mc bnh cao huyt p; nu chng ta tin hnh chn mu 1000 ln, mi ln chn 20 ngi trong qun th mt cch ngu nhin, s phn phi s bnh nhn cao huyt p s nh th no? tr li cu hi ny, chng ta c th ng dng hm rbinom (n, k, p) trong R vi nhng thng s nh sau:

    > b table(b) b 0 1 2 3 4 5 6 7 8 9 10 6 45 147 192 229 169 105 68 23 13 3 Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45 mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s trn bng lnh hist nh sau: > hist(b, main="Number of hypertensive patients")

    Number of hypertensive patients

    b

    Freq

    uenc

    y

    0 2 4 6 8 10

    050

    100

    150

    200

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    35

    Biu 1. Phn phi s bnh nhn cao huyt p trong s 20 ngi c chn ngu nhin trong mt qun th gm 20% bnh nhn cao huyt p, v chn mu c lp li 1000 ln. Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p (trong mi ln chn mu 20 ngi) l cao nht (22.9%). iu ny cng c th hiu c, bi v t l cao huyt p l 20%, cho nn chng ta k vng rng trung bnh 4 ngi trong s 20 ngi c chn phi l cao huyt p. Tuy nhin, iu quan trng m biu trn th hin l c khi chng ta quan st n 10 bnh nhn cao huyt p d xc sut cho mu ny rt thp (ch 3/1000). 7.3.2 Hm phn phi Poisson (Poisson distribution) Hm phn phi Poisson, ni chung, rt ging vi hm nh phn, ngoi tr thng s p thng rt nh v n thng rt ln. V th, hm Poisson thng c s dng m t cc bin s rt him xy ra (nh s ngi mc ung th trong mt dn s chng hn). Hm Poisson cn c ng dng kh nhiu v thnh cng trong cc nghin cu k thut v th trng nh s lng khch hng n mt nh hng mi gi. V d 4: Hm mt Poisson (Poisson density probability function). Qua theo di nhiu thng, ngi ta bit c t l nh sai chnh t ca mt th k nh my. Tnh trung bnh c khong 2.000 ch th th k nh sai 1 ch. Hi xc sut m th k nh sai chnh t 2 ch, hn 2 ch l bao nhiu?

    V tn s kh thp, chng ta c th gi nh rng bin s sai chnh t (tm t tn l bin s X) l mt hm ngu nhin theo lut phn phi Poisson. y, chng ta c t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson pht biu rng xc sut m X = k, vi iu kin t l trung bnh , :

    ( )|!

    keP X kk

    = =

    Do , p s cho cu hi trn l: ( ) 2 212 | 1 0.18392!

    eP X

    = = = = . p s ny c th tnh bng R mt cch nhanh chng hn bng hm dpois nh sau: > dpois(2, 1) [1] 0.1839397 Chng ta cng c th tnh xc sut sai 1 ch, v xc sut khng sai ch no: > dpois(1, 1) [1] 0.3678794 > dpois(0, 1)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    36

    [1] 0.3678794 Ch trong hm trn, chng ta ch n gin cung cp thng s k = 2 v ( = 1. Trn y l xc sut m th k nh sai chnh t ng 2 ch. Nhng xc sut m th k nh sai chnh t hn 2 ch (tc 3, 4, 5, ch) c th c tnh bng:

    ( ) ( ) ( )2 3 4 ( 5) ...P X P X P X P X> = = + = + = + = ( )1 2P X = 1 0.3678 0.3678 0.1839 = 0.08

    Bng R, chng ta c th tnh nh sau: # P(X 2) > ppois(2, 1) [1] 0.9196986 # 1-P(X 2) > 1-ppois(2, 1) [1] 0.0803014 7.3.3 Hm phn phi chun (Normal distribution)

    Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng quan trng nht ca phn tch thng k. C th ni khng ngoa rng hu ht l thuyt thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi chun c hai thng s: trung bnh v phng sai 2 (hay lch chun ). Gi X l mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc sut m X = x l:

    ( ) ( ) ( )22 21| , exp 22 xP X x f x = = =

    V d 5: Hm mt phn phi chun (Normal density probability function).

    Chiu cao trung bnh hin nay ph n Vit Nam l 156 cm, vi lch chun l 4.6 cm. Cng bit rng chiu cao ny tun theo lut phn phi chun. Vi hai thng s =156, =4.6, chng ta c th xy dng mt hm phn phi chiu cao cho ton b qun th ph n Vit Nam, v hm ny c hnh dng nh sau:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    37

    130 140 150 160 170 180 190 200

    0.00

    0.02

    0.04

    0.06

    0.08

    Probability distribution of height in Vietnamese women

    Height

    f(hei

    ght)

    Biu 2. Phn phi chiu cao ph n Vit Nam vi trung bnh 156 cm v lch chun 4.6 cm. Trng honh l chiu cao v trc tung l xc sut cho mi chiu cao. Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin trung bnh l 156 cm v lch chun l 4.6 cm. > height plot(height, dnorm(height, 156, 4.6), type="l", ylab=f(height), xlab=Height, main="Probability distribution of height in Vietnamese women")

    Vi hai thng s trn (v biu ), chng ta c th c tnh xc sut cho bt c

    chiu cao no. Chng hn nh xc sut mt ph n Vit Nam c chiu cao 160 cm l:

    P(X = 160 | =156, =4.6) = ( )( )2

    2

    160 1561 exp4.6 2 3.1416 2 4.6

    = 0.0594 Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt cch gn nh: > dnorm(160, mean=156, sd=4.6) [1] 0.05942343

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    38

    Hm xc sut chun tch ly (cumulative normal probability function). V chiu cao l mt bin s lin tc, trong thc t chng ta t khi no mun tm xc sut cho mt gi tr c th x, m thng tm xc sut cho mt khong gi tr a n b. Chng hn nh chng ta mun bit xc sut chiu cao t 150 n 160 cm (tc l P(160 X 150), hay xc sut chiu cao thp hn 145 cm, tc P(X < 145). tm p s cc cu hi nh th, chng ta cn n hm xc sut chun tch ly, c nh ngha nh sau:

    P(a X b) = ( )ba

    f x dx Thnh ra, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu 2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho mt phn phi chun rt c ch.

    pnorm (a, mean, sd) = ( )a f x dx = P(X a | mean, sd) Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%: > pnorm(150, 156, 4.6) [1] 0.0960575 Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l: > 1-pnorm(164, 156, 4.6) [1] 0.04100591 Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165 cm.

    V d 6: ng dng lut phn phi chun: Trong mt qun th, chng ta bit rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr li bng R l: > 1-pnorm(120, mean=100, sd=13) [1] 0.0619679 Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg. 7.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    39

    Mt bin X tun theo lut phn phi chun vi trung bnh bnh v phng sai 2 thng c vit tt l:

    X ~ N( , 2) y v 2 ty thuc vo n v o lng ca bin s. Chng hn nh chiu

    cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm, v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1. Sau vi thao tc s hc, c th chng minh d dng rng, cch bin i X p ng iu kin trn l:

    XZ =

    Ni theo ngn ng ton: nu X ~ N( , 2), th (X )/2 ~ N(0, 1). Nh vy qua

    cng thc trn, Z thc cht l khc bit gia mt s v trung bnh tnh bng s lch chun. Nu Z = 0, chng ta bit rng X bng s trung bnh . Nu Z = -1, chng ta bit rng X thp hn ng 1 lch chun. Tng t, Z = 2.5, chng ta bit rng X cao hn ng 2.5 lch chun. v.v Biu phn phi chiu cao ca ph n Vit Nam c th m t bng mt n v mi, l ch s z nh sau:

    -4 -2 0 2 4

    0.0

    0.1

    0.2

    0.3

    0.4

    Probability distribution of height in Vietnamese women

    z

    f(z)

    Biu 3. Phn phi chun ha chiu cao ph n Vit Nam. Biu trn c v bng hai lnh sau y:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    40

    > height plot(height, dnorm(height, 0, 1), type="l", ylab=f(z), xlab=z, main="Probability distribution of height in Vietnamese women")

    Vi phn phi chun chun ho, chng ta c mt tin li l c th dng n m t v so snh mt phn phi ca bt c bin no, v tt c u c chuyn sang ch s z. Trong biu trn, trc tung l xc sut z v trc honh l bin s z. Chng ta c th tnh ton xc sut z nh hn mt hng s (constant) no d dng bng R. V d, chng ta mun tm P(z -1.96) = ? cho mt phn phi m trung bnh l 0 v lch chun l 1. > pnorm(-1.96, mean=0, sd=1) [1] 0.02499790 Hay P(z 1.96) = ? > pnorm(1.96, mean=0, sd=1) [1] 0.9750021 Do , P(-1.96 < z < 1.96) chnh l: > pnorm(1.96) - pnorm(-1.96) [1] 0.9500042 Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn ti khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default value) ca thng s mean l 0 v sd l 1).

    V d 5 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170 cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v ti l cc ph n Vit Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%. > 1-pnorm(3.04) [1] 0.001182891 Tm nh lng (quantile) ca mt phn phi chun. i khi chng ta cn lm mt tnh ton o ngc. Chng hn nh chng ta mun bit: nu xc sut Z nh hn mt hng s z no cho trc bng p, th z l bao nhiu? Din t theo k hiu xc sut, chng ta mun tm z trong nu:

    P(Z < z) = p tr li cu hi ny, chng ta s dng hm qnorm(p, mean=, sd=).

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    41

    V d 7: Bit rng Z ~ N(0, 1) v nu P(Z < z) = 0.95, chng ta mun tm z. > qnorm(0.95, mean=0, sd=1) [1] 1.644854 Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1: > qnorm(0.975, mean=0, sd=1) [1] 1.959964 7.4 Chn mu ngu nhin (random sampling)

    Trong xc sut v thng k, ly mu ngu nhin rt quan trng, v n m bo tnh hp l ca cc phng php phn tch v suy lun thng k. Vi R, chng ta c th ly mu mt mu ngu nhin bng cch s dng hm sample.

    V d 8: Chng ta c mt qun th gm 40 ngi (m s 1, 2, 3, , 40). Nu

    chng ta mun chn 5 i tng qun th , ai s l ngi c chn? Chng ta c th dng lnh sample() tr li cu hi nh sau: > sample(1:40, 5) [1] 32 26 6 18 9 Kt qu trn cho bit i tng 32, 26, 8, 18 v 9 c chn. Mi ln ra lnh ny, R s chn mt mu khc, ch khng hon ton ging nh mu trn. V d: > sample(1:40, 5) [1] 5 22 35 19 4 > sample(1:40, 5) [1] 24 26 12 6 22 > sample(1:40, 5) [1] 22 38 11 6 18 v.v Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn vo qun th. Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng, chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10 ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with replacement), chng ta ch cn thm tham s replace = TRUE: > sample(1:50, 10, replace=T)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    42

    [1] 31 44 6 8 47 50 10 16 29 23 Hay nm mt ng xu 10 ln; mi ln, d nhin ng xu c 2 kt qu H v T; v kt qu 10 ln c th l: > sample(c("H", "T"), 10, replace=T) [1] "H" "T" "H" "H" "H" "T" "H" "H" "T" "T" Cng c th tng tng chng ta c 5 qu banh mu xanh (X) v 5 qu banh mu (D) trong mt bao. Nu chng ta chn 1 qu banh, ghi nhn mu, ri li vo bao; ri li chn 1 qu banh khc, ghi nhn mu, v b vo bao li. C nh th, chng ta chn 20 ln, kt qu c th l: > sample(c("X", "D"), 20, replace=T) [1] "X" "D" "D" "D" "D" "D" "X" "X" "X" "X" "X" "D" "X" "X" "D" "X" "X" "X" "X" [20] "D" Ngoi ra, chng ta cn c th ly mu vi mt xc sut cho trc. Trong hm sau y, chng ta chn 10 i tng t dy s 1 n 5, nhng xc sut khng bng nhau: > sample(5, 10, prob=c(0.3, 0.4, 0.1, 0.1, 0.1), replace=T) [1] 3 1 3 2 2 2 2 2 5 1 i tng 1 c chn 2 ln, i tng 2 c chn 5 ln, i tng 3 c chn 2 ln, v.v Tuy khng hon ton ph hp vi xc sut 0.3, 0.4, 0.1 nh cung cp v s mu cn nh, nhng cng khng qu xa vi k vng. 8. Biu

    Trong ngn ng R c rt nhiu cch thit k mt biu gn v p. Phn ln nhng hm thit k biu c sn trong R, nhng mt s loi biu tinh vi v phc tp khc c th thit k bng cc package chuyn dng nh lattice hay trellis c th ti t website ca R. Trong chng ny ti s ch cch v cc biu thng dng bng cch s dng cc hm ph bin trong R. 8.1 S liu cho phn tch biu

    Sau khi bit qua mi trng v nhng la chn thit k mt biu , by gi chng ta c th s dng mt s hm thng dng v cc biu cho s liu. Theo ti, biu c th chia thnh 2 loi chnh: biu dng m t mt bin s v biu v mi lin h gia hai hay nhiu bin s. Tt nhin, bin s c th l lin tc hay khng lin tc, cho nn, trong thc t, chng ta c 4 loi biu . Trong phn sau y, ti s im qua cc loi biu , t n gin n phc tp.

    C l cch tt nht tm hiu cch v th bng R l bng mt d liu thc t. Ti s quay li v d 2 (phn 4.2). Trong v d , chng ta c d liu gm 8 ct (hay

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    43

    bin s): id, sex, age, bmi, hdl, ldl, tc, v tg. (Ch , id l m s ca 50 i tng nghin cu; sex l gii tnh (nam hay n); age l tui; bmi l t s trng lng; hdl l high density cholesterol; ldl l low density cholesterol; tc l tng s - total cholesterol; v tg triglycerides). D liu c cha trong directory directory c:\works\insulin di tn chol.txt. Trc khi v th, chng ta bt u bng cch nhp d liu ny vo R. > setwd(c:/works/stats) > cong attach(cong) Hay tin vic theo di ti s nhp cc d liu bng cc lnh sau y: sex

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    44

    8.2 Biu cho mt bin s ri rc (discrete variable): barplot

    Bin sex trong d liu trn c hai gi tr (nam v nu), tc l mt bin khng lin tc. Chng ta mun bit tn s ca gii tnh (bao nhiu nam v bao nhiu n) v v mt biu n gin. thc hin nh ny, trc ht, chng ta cn dng hm table bit tn s: > sex.freq sex.freq sex Nam Nu 22 28 C 22 nam v 28 na trong nghin cu. Sau dng hm barplot th hin tn s ny nh sau: > barplot(sex.freq, main=Frequency of males and females) Biu trn cng c th c c bng mt lnh n gin hn (Biu 8a): > barplot(table(sex), main=Frequency of males and females)

    Nam Nu

    Frequency of males and females

    05

    1015

    2025

    Nam

    Nu

    Frequency of males and females

    0 5 10 15 20 25

    Biu 8a. Tn s gii tnh th hin bng ct s.

    Biu 8b. Tn s gii tnh th hin bng dng s.

    Thay v th hin tn s nam v n bng 2 ct, chng ta c th th hin bng hai dng bng thng s horiz = TRUE, nh sau (xem kt qu trong Biu 6b): > barplot(sex.freq, horiz = TRUE, col = rainbow(length(sex.freq)), main=Frequency of males and females)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    45

    8.3 Biu cho hai bin s ri rc (discrete variable): barplot

    Age l mt bin s lin tc. Chng ta c th chia bnh nhn thnh nhiu nhm da vo tui. Hm cut c chc nng ct mt bin lin tc thnh nhiu nhm ri rc. Chng hn nh: > ageg table(ageg) ageg (42,54.7] (54.7,67.3] (67.3,80] 19 24 7 C hiu qu chia bin age thnh 3 nhm. Tn s ca ba nhm ny l: 42 tui n 54.7 tui thnh nhm 1, 54.7 n 67.3 thnh nhm 2, v 67.3 n 80 tui thnh nhm 3. Nhm 1 c 19 bnh nhn, nhm 2 v 3 c 24 v 7 bnh nhn. By gi chng ta mun bit c bao nhiu bnh nhn trong tng tui v tng gii tnh bng lnh table: > age.sex age.sex ageg sex (42,54.7] (54.7,67.3] (67.3,80] Nam 10 10 2 Nu 9 14 5 Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui th nht, 10 nam v 14 na trong nhm tui th hai, v.v th hin tn s ca hai bin ny, chng ta vn dng barplot: > barplot(age.sex, main=Number of males and females in each age

    group)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    46

    (42,54.7] (54.7,67.3] (67.3,80]

    Number of males and females in each age group

    05

    1015

    20

    (42,54.7] (54.7,67.3] (67.3,80]

    Age group

    02

    46

    810

    1214

    Biu 9a. Tn s gii tnh v nhm tui th hin bng ct s.

    Biu 9b. Tn s gii tnh v nhm tui th hin bng hai dng s.

    Trong Biu 9a, mi ct l cho mt tui, v phn m ca ct l n, v phn mu nht l tn s ca nam gii. Thay v th hin tn s nam n trong mt ct, chng ta cng c th th hin bng 2 ct vi beside=T nh sau (Biu 9b): barplot(age.sex, beside=TRUE, xlab="Age group") 8.4 Biu hnh trn Tn s mt bin ri rc cng c th th hin bng biu hnh trn. V d sau y v biu tn s ca tui. Biu 10a l 3 nhm tui, v Biu 10b l biu tn s cho 5 nhm tui: > pie(table(ageg)) pie(table(cut(age,5)))

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    47

    (42,54.7]

    (54.7,67.3]

    (67.3,80]

    (42,49.6](49.6,57.2]

    (57.2,64.8](64.8,72.4]

    (72.4,80]

    Biu 10a. Tn s cho 3 nhm tui Biu 10b. Tn s cho 5 nhm tui 8.5 Biu cho mt bin s lin tc: stripchart v hist 8.5.1 Stripchart

    Biu strip cho chng ta thy tnh lin tc ca mt bin s. Chng hn nh chng ta mun tm hiu tnh lin tc ca triglyceride (tg), hm stripchart() s gip trong mc tiu ny: > stripchart(tg, main=Strip chart for triglycerides, xlab=mg/L)

    1 2 3 4 5 6

    Strip chart for triglycerides

    mg/L

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    48

    Chng ta thy bin s tg c s bt lin tc, nht l cc i tng c tg cao. Trong khi phn ln i tng c tg thp hn 5, th c 2 i tng vi tg rt cao (>5). 8.5.2 Histogram

    Age l mt bin s lin tc. v biu tn s ca bin s age, chng ta ch n gin lnh hist(age). Nh cp trn, chng ta c th ci tin th ny bng cch cho thm ta chnh (main) v ta ca trc honh (xlab) v trc tung (ylab): > hist(age) > hist(age, main="Frequency distribution by age group", xlab="Age group", ylab="No of patients")

    Histogram of age

    age

    Freq

    uenc

    y

    40 50 60 70 80

    02

    46

    810

    12

    Frequency distribution by age group

    Age group

    No

    of p

    atie

    nts

    40 50 60 70 80

    02

    46

    810

    12

    Biu 11a. Trc tung l s bnh nhn (i tng nghin cu) v trc honh l tui. Chng hn nh tui 40 n 45 c 6 bnh nhn, t 70 n 80 tui c 4 bnh nhn.

    Biu 11b. Thm tn biu v tn ca trc trung v trc honh bng xlab v ylab.

    Chng ta cng c th bin i biu thnh mt th phn phi xc sut bng hm plot(density) nh sau (kt qu trong Biu 12a): > plot(density(age),add=TRUE)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    49

    30 40 50 60 70 80 90

    0.00

    0.01

    0.02

    0.03

    0.04

    density.default(x = age)

    N = 50 Bandwidth = 3.806

    Den

    sity

    Histogram of age

    age

    Den

    sity

    40 50 60 70 80

    0.00

    0.01

    0.02

    0.03

    0.04

    Biu 12a. Xc sut phn phi mt cho bin age ( tui).

    Biu 12b. Xc sut phn phi mt cho bin age ( tui) vi nhiu interquartile.

    Chng ta c th v hai th chng ln bng cch dng hm interquartile nh sau (kt qu xem Biu 12b): 8.6 Biu hp (boxplot) v biu hp ca bin s tc, chng ta ch n gin lnh: > boxplot(tc, main="Box plot of total cholesterol", ylab="mg/L")

    34

    56

    78

    Box plot of total cholesterol

    mg/

    L

    Biu 13. Trong biu ny, chng ta thy median (trung v) khong 5.6 mg/L, 25% total cholesterol thp hn 4.1, v 75% thp hn 6.2. Total cholesterol thp nht

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    50

    l khoang 3, v cao nht l trn 8 mg/L. Trong biu sau y, chng ta so snh tc gia hai nhm nam v n: > boxplot(tc ~ sex, main=Box plot of total cholestrol by sex, ylab="mg/L") Kt qu trnh by trong Biu 14a. Chng ta c th bin giao din ca th bng cch dng thng s horizontal=TRUE v thay i mu bng thng s col nh sau (Biu 14b): > boxplot(tc~sex, horizontal=TRUE, main="Box plot of total cholesterol", ylab="mg/L", col = "pink")

    Nam Nu

    34

    56

    78

    Box plot of total cholesterol by sex

    mg/

    L

    Nam

    Nu

    3 4 5 6 7 8

    Box plot of total cholesterol

    mg/

    L

    Biu 14a. Trong biu ny, chng ta thy trung v ca total cholesterol n gii thp hn nam gii, nhng dao ng gia hai nhm khng khc nhau bao nhiu.

    Biu 14b. Total cholesterol cho tng gii tnh, vi mu sc v hnh hp nm ngang.

    8.7 Phn tch biu cho hai bin lin tc 8.7.1 Biu tn x (scatter plot) tm hiu mi lin h gia hai bin, chng ta dng biu tn x. v biu tn x v mi lin h gia bin s tc v hdl, chng ta s dng hm plot. Thng s th nht ca hm plot l trc honh (x-axis) v thng s th 2 l trc tung. tm hiu mi lin h gia tc v hdl chng ta n gin lnh: > plot(tc, hdl)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    51

    3 4 5 6 7 8

    24

    68

    tc

    hdl

    Biu 15. Mi lin h gia tc v hdl. Trong biu ny, chng ta v bin s hdl trn trc tung v tc trn trc honh. Chng ta mun phn bit gii tnh (nam v n) trong biu trn. v biu , chng ta phi dng n hm ifelse. Trong lnh sau y, nu sex==Nam th v k t s 16 ( trn), nu khng nam th v k t s 22 (tc vung): > plot(hdl, tc, pch=ifelse(sex=="Nam", 16, 22)) Kt qu l Biu 16a. Chng ta cng c th thay k t thnh M (nam) v F n(xem Biu 16b): > plot(hdl, tc, pch=ifelse(sex=="Nam", M, F))

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    52

    3 4 5 6 7 8

    24

    68

    tc

    hdl

    M

    F

    F

    M

    M

    F

    MM

    M

    F

    F

    M

    F

    M

    M

    F F

    F

    F

    F

    F

    F

    F

    F M

    M FM

    F

    F

    F

    M

    M

    F

    F

    M

    F

    M

    F

    FM

    F

    M

    M

    M

    F

    M

    M

    F

    F

    2 4 6 8

    34

    56

    78

    hdl

    tc

    Biu 16a. Mi lin h gia tc v hdl theo tng gii tnh c th hin bng hai k hiu du.

    Biu 16a. Mi lin h gia tc v hdl theo tng gii tnh c th hin bng hai k t.

    Chng ta cng c th v mt ng biu din hi qui tuyn tnh (regression line) qua cc im trn bng cch tip tc ra cc lnh sau y: > plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL cholesterol", xlab="Total cholesterol", ylab="HDL cholesterol", bty=l) > reg abline(reg) Kt qu l Biu 17a di y. Chng ta cng c th dng hm trn (smooth function) biu din mi lin h gia hai bin s. th sau y s dng lowess (mt hm thng thng nht) trong vic lm trn s liu tc v hdl (Biu 17b). > plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL cholesterol with LOEWSS smooth function", xlab="Total cholesterol", ylab="HDL cholesterol", bty=l) > lines(lowess(hdl, tc, f=2/3, iter=3), col="red")

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    53

    3 4 5 6 7 8

    24

    68

    Total cholesterol and HDL cholesterol

    Total cholesterol

    HD

    L ch

    oles

    tero

    l

    3 4 5 6 7 8

    24

    68

    Total cholesterol and HDL cholesterol with LOEWSS smooth function

    Total cholesterol

    HD

    L ch

    oles

    tero

    l

    Biu 17a. Trong lnh trn, reg lipid pairs(lipid, pch=16) Kt qu s l:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    54

    age

    18 20 22 24 1 2 3 4 5 6

    5060

    7080

    1820

    2224

    bmi

    hdl

    24

    68

    12

    34

    56

    ldl

    50 60 70 80 2 4 6 8 3 4 5 6 7 8

    34

    56

    78

    tc

    8.9 Biu vi sai s chun (standard error) Trong biu sau y, chng ta c 5 nhm (bin s x c m phng ch khng phi s liu tht), v mi nhm c gi tr trung bnh mean, v tin cy 95% (lcl v ucl). Thng thng lcl=mean-1.96*SE v ucl = mean+1.96*SE (SE l sai s chun). Chng ta mun v biu cho 5 nhm vi sai s chun . Cc lnh v hm sau y s cn thit: > group mean lcl ucl plot(group, mean, ylim=range(c(lcl, ucl))) > arrows(group, ucl, group, lcl, length=0.5, angle=90, code=3)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    55

    1 2 3 4 5

    12

    34

    5

    group

    mea

    n

    9. Phn tch thng k m t 9.1 Thng k m t (descriptive statistics, summary) minh ha cho vic p dng R vo thng k m t, ti s s dng mt d liu nghin cu c tn l igfdata. Trong nghin cu ny, ngoi cc ch s lin quan n gii tnh, tui, trng lng v chiu cao, chng ti o lng cc hormone lin quan n tnh trng tng trng nh igfi, igfbp3, als, v cc markers lin quan n s chuyn ha ca xng pinp, ictp v pinp. C 100 i tng nghin cu. D liu ny c cha trong directory c:\works\stats. Trc ht, chng ta cn phi nhp d liu vo R vi nhng lnh sau y (cc cu ch theo sau du # l nhng ch thch bn c theo di): > options(width=100) # chuyn directory > setwd("c:/works/stats") # c d liu vo R > igfdata attach(igfdata) # xem xt cc ct s trong d liu > names(igfdata) [1] "id" "sex" "age" "weight" "height" "ethnicity" [7] "igfi" "igfbp3" "als" "pinp" "ictp" "p3np" > igfdata id sex age weight height ethnicity igfi igfbp3 als pinp ictp p3np

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    56

    1 1 Female 15 42 162 Asian 189.000 4.00000 323.667 353.970 11.2867 8.3367 2 2 Male 16 44 160 Caucasian 160.000 3.75000 333.750 375.885 10.4300 6.7450 3 3 Female 15 43 157 Asian 146.833 3.43333 248.333 199.507 8.3633 12.5000 4 4 Female 15 42 155 Asian 185.500 3.40000 251.000 483.607 13.3300 14.2767 5 5 Female 16 47 167 Asian 192.333 4.23333 322.000 105.430 7.9233 4.5033 6 6 Female 25 45 160 Asian 110.000 3.50000 284.667 76.487 4.9833 4.9367 7 7 Female 19 45 161 Asian 157.000 3.20000 274.000 75.880 6.3500 5.3200 8 8 Female 18 43 153 Asian 146.000 3.40000 303.000 86.360 7.3700 4.6700 9 9 Female 15 41 149 Asian 197.667 3.56667 308.500 254.803 11.8700 6.8200 10 10 Female 24 45 157 African 148.000 3.40000 273.000 44.720 3.7400 6.1600 ... ... 97 97 Female 17 54 168 Caucasian 204.667 4.96667 441.333 64.130 5.1600 4.4367 98 98 Male 18 55 169 Asian 178.667 3.86667 273.000 185.913 7.5267 8.8333 99 99 Female 18 48 151 Asian 237.000 3.46667 324.333 105.127 5.9867 5.6600 100 100 Male 15 54 168 Asian 130.000 2.70000 259.333 325.840 10.2767 6.5933 Trn y ch l mt phn s liu trong s 100 i tng. Cho mt bin s 1 2 3, , ,..., nx x x x chng ta c th tnh ton mt s ch s thng k m t nh sau: L thuyt Hm R

    S trung bnh: xn

    xii

    n

    ==1

    1

    .

    mean(x)

    Phng sai: ( ) = =n

    ii xxn

    s1

    221

    1

    var(x)

    lch chun: 2s s= sd(x) Sai s chun (standard error): sSE

    n= Khng c

    Tr s thp nht min(x) Tr s cao nht max(x) Ton c (range) range(x) V d 9: tm gi tr trung bnh ca tui, chng ta ch n gin lnh: > mean(age) [1] 19.17 Hay phng sai v c lch chun ca tui: > var(age) [1] 15.33444 > sd(age) [1] 3.915922

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    57

    Tuy nhin, R c lnh summary c th cho chng ta tt c thng tin thng k v mt bin s: > summary(age) Min. 1st Qu. Median Mean 3rd Qu. Max. 13.00 16.00 19.00 19.17 21.25 34.00

    Ni chung, kt qu ny n gin v cc vit tt cng c th d hiu. Ch , trong kt qu trn, c hai ch s 1st Qu v 3rd Qu c ngha l first quartile (tng ng vi v tr 25%) v third quartile (tng ng vi v tr 75%) ca mt bin s. First quartile = 16 c ngha l 25% i tng nghin cu c tui bng hoc nh hn 16 tui. Tng t, Third quartile = 34 c ngha l 75% i tng c tui bng hoc thp hn 34 tui. Tt nhin s trung v (median) 19 cng c ngha l 50% i tng c tui 19 tr xung (hay 19 tui tr ln). R khng c hm tnh sai s chun, v trong hm summary, R cng khng cung cp lch chun. c cc s ny, chng ta c th t vit mt hm n gin (hy gi l desc) nh sau: desc

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    58

    Min. : 85.71 Min. :2.000 Min. :192.7 Min. : 26.74 Min. : 2.697 1st Qu.:137.17 1st Qu.:3.292 1st Qu.:256.8 1st Qu.: 68.10 1st Qu.: 4.878 Median :161.50 Median :3.550 Median :292.5 Median :103.26 Median : 6.338 Mean :165.59 Mean :3.617 Mean :301.8 Mean :167.17 Mean : 7.420 3rd Qu.:186.46 3rd Qu.:3.875 3rd Qu.:331.2 3rd Qu.:196.45 3rd Qu.: 8.423 Max. :427.00 Max. :5.233 Max. :471.7 Max. :742.68 Max. :21.237 p3np Min. : 2.343 1st Qu.: 4.433 Median : 5.445 Mean : 6.341 3rd Qu.: 7.150 Max. :16.303

    R tnh ton tt c cc bin s no c th tnh ton c! Thnh ra, ngay c ct id (tc m s ca i tng nghin cu) R cng tnh lun! (v chng ta bit kt qu ca ct id chng c ngha thng k g). i vi cc bin s mang tnh phn loi nh sex v ethnicity (sc tc) th R ch bo co tn s cho mi nhm.

    Kt qu trn cho tt c i tng nghin cu. Nu chng ta mun kt qu cho

    tng nhm nam v n ring bit, hm by trong R rt hu dng. Trong lnh sau y, chng ta yu cu R tm lc d liu igfdata theo sex. > by(igfdata, sex, summary) sex: Female id sex age weight height Min. : 1.0 Female:69 Min. :13.00 Min. :41.00 Min. :149.0 1st Qu.:21.0 Male : 0 1st Qu.:17.00 1st Qu.:47.00 1st Qu.:156.0 Median :47.0 Median :19.00 Median :50.00 Median :162.0 Mean :48.2 Mean :19.59 Mean :49.35 Mean :161.9 3rd Qu.:75.0 3rd Qu.:22.00 3rd Qu.:52.00 3rd Qu.:166.0 Max. :99.0 Max. :34.00 Max. :60.00 Max. :196.0 ethnicity igfi igfbp3 als African : 4 Min. : 85.71 Min. :2.767 Min. :204.3 Asian :43 1st Qu.:136.67 1st Qu.:3.333 1st Qu.:263.8 Caucasian:22 Median :163.33 Median :3.567 Median :302.7 Others : 0 Mean :167.97 Mean :3.695 Mean :311.5 3rd Qu.:186.17 3rd Qu.:3.933 3rd Qu.:361.7 Max. :427.00 Max. :5.233 Max. :471.7 pinp ictp p3np Min. : 26.74 Min. : 2.697 Min. : 2.343 1st Qu.: 62.75 1st Qu.: 4.717 1st Qu.: 4.337 Median : 78.50 Median : 5.537 Median : 5.143 Mean :108.74 Mean : 6.183 Mean : 5.643 3rd Qu.:115.26 3rd Qu.: 7.320 3rd Qu.: 6.143 Max. :502.05 Max. :13.633 Max. :14.420 ------------------------------------------------------------ sex: Male id sex age weight height Min. : 2.00 Female: 0 Min. :14.00 Min. :44.00 Min. :155.0 1st Qu.: 34.50 Male :31 1st Qu.:15.00 1st Qu.:48.50 1st Qu.:161.5 Median : 56.00 Median :17.00 Median :51.00 Median :164.0 Mean : 55.61 Mean :18.23 Mean :51.16 Mean :165.6 3rd Qu.: 75.00 3rd Qu.:20.00 3rd Qu.:53.50 3rd Qu.:169.0 Max. :100.00 Max. :27.00 Max. :59.00 Max. :191.0 ethnicity igfi igfbp3 als

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    59

    African : 4 Min. : 94.67 Min. :2.000 Min. :192.7 Asian :17 1st Qu.:138.67 1st Qu.:3.183 1st Qu.:249.8 Caucasian: 8 Median :160.00 Median :3.500 Median :276.0 Others : 2 Mean :160.29 Mean :3.443 Mean :280.2 3rd Qu.:183.00 3rd Qu.:3.775 3rd Qu.:311.3 Max. :274.00 Max. :4.500 Max. :388.7 pinp ictp p3np Min. : 56.28 Min. : 3.650 Min. : 3.390 1st Qu.:135.07 1st Qu.: 6.900 1st Qu.: 5.375 Median :245.92 Median : 9.513 Median : 7.140 Mean :297.21 Mean :10.173 Mean : 7.895 3rd Qu.:450.38 3rd Qu.:13.517 3rd Qu.:10.010 Max. :742.68 Max. :21.237 Max. :16.303

    xem qua phn phi ca cc hormones v ch s sinh ha cng mt lc, chng ta c th v th cho tt c 6 bin s. Trc ht, chia mn nh thnh 6 ca s (vi 2 dng v 3 ct); sau ln lt v: > op hist(igfi) > hist(igfbp3) > hist(als) > hist(pinp) > hist(ictp) > hist(p3np)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    60

    Histogram of igfi

    igf i

    Freq

    uenc

    y

    100 200 300 400

    010

    2030

    40

    Histogram of igfbp3

    igfbp3

    Freq

    uenc

    y

    2.0 3.0 4.0 5.00

    1020

    3040

    Histogram of als

    als

    Freq

    uenc

    y

    150 250 350 450

    010

    2030

    Histogram of pinp

    pinp

    Freq

    uenc

    y

    0 200 400 600 800

    010

    2030

    4050

    Histogram of ictp

    ictp

    Freq

    uenc

    y

    5 10 15 20

    010

    2030

    Histogram of p3np

    p3np

    Freq

    uenc

    y

    5 10 15

    010

    2030

    40

    9.2 Thng k m t theo tng nhm Nu chng ta mun tnh trung bnh ca mt bin s nh igfi cho mi nhm nam v n gii, hm tapply trong R c th dng cho vic ny: > tapply(igfi, list(sex), mean) Female Male 167.9741 160.2903 Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l sex, v ch s thng k chng ta mun l trung bnh (mean). Qua kt qu trn, chng ta thy s trung bnh ca igfi cho n gii (167.97) cao hn nam gii (160.29). Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta ch cn thm mt bin s trong hm list: > tapply(igfi, list(ethnicity, sex), mean) Female Male African 145.1252 120.9168

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    61

    Asian 165.6589 160.4999 Caucasian 176.6536 169.4790 Others NA 200.5000 Trong kt qu trn, NA c ngha l not available, tc khng c s liu cho ph n trong cc sc tc others. 9.3 Kim nh t (t.test)

    Kim nh t da vo gi thit phn phi chun. C hai loi kim nh t: kim nh t cho mt mu (one-sample t-test), v kim nh t cho hai mu (two-sample t-test). Kim nh t mt mu nm tr li cu hi d liu t mt mu c phi tht s bng mt thng s no hay khng. Cn kim nh t hai mu th nhm tr li cu hi hai mu c cng mt lut phn phi, hay c th hn l hai mu c tht s c cng tr s trung bnh hay khng. Ti s ln lt minh ha hai kim nh ny qua s liu igfdata trn. 9.3.1 Kim nh t mt mu V d 10. Qua phn tch trn, chng ta thy tui trung bnh ca 100 i tng trong nghin cu ny l 19.17 tui. Chng hn nh trong qun th ny, trc y chng ta bit rng tui trung bnh l 30 tui. Vn t ra l c phi mu m chng ta c c c i din cho qun th hay khng. Ni cch khc, chng ta mun bit gi tr trung bnh 19.17 c tht s khc vi gi tr trung bnh 30 hay khng. tr li cu hi ny, chng ta s dng kim nh t. Theo l thuyt thng k, kim nh t c nh ngha bng cng thc sau y:

    /xts n

    = Trong , x l gi tr trung bnh ca mu, l trung bnh theo gi thit (trong trng hp ny, 30), s l lch chun, v n l s lng mu (100). Nu gi tr t cao hn gi tr l thuyt theo phn phi t mt tiu chun c ngha nh 5% chng hn th chng ta c l do pht biu khc bit c ngha thng k. Gi tr ny cho mu 100 c th tnh ton bng hm qt ca R nh sau: > qt(0.95, 100) [1] 1.660234 Nhng c mt cch tnh ton nhanh gn hn tr li cu hi trn, bng cch dng hm t.test nh sau: > t.test(age, mu=30) One Sample t-test

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    62

    data: age t = -27.6563, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 30 95 percent confidence interval: 18.39300 19.94700 sample estimates: mean of x 19.17 Trong lnh trn age l bin s chng ta cn kim nh, v mu=30 l gi tr gi thit. R trnh by tr s t = -27.66, vi 99 bc t do, v tr s p < 2.2e-16 (tc rt thp). R cng cho bit tin cy 95% ca age l t 18.4 tui n 19.9 tui (30 tui nm qu ngoi khong tin cy ny). Ni cch khc, chng ta c l do pht biu rng tui trung bnh trong mu ny tht s thp hn tui trung bnh ca qun th. 9.3.2 Kim nh t hai mu V d 11. Qua phn tch m t trn (phm summary) chng ta thy ph n c hormone igfi cao hn nam gii (167.97 v 160.29). Cu hi t ra l c phi tht s l mt khc bit c h thng hay do cc yu t ngu nhin gy nn. Tr li cu hi ny, chng ta cn xem xt mc khc bit trung bnh gia hai nhm v lch chun ca khc bit.

    2 1x xtSED

    = Trong 1x v 2x l s trung bnh ca hai nhm nam v n, v SED l lch chun ca ( 1x - 2x ) . Thc ra, SED c th c tnh bng cng thc:

    2 2

    1 2SED SE SE= + Trong 1SE v 2SE l sai s chun (standard error) ca hai nhm nam v n. Theo l thuyt xc sut, t tun theo lut phn phi t vi bc t do 1 2 2n n+ , trong n1 v n2 l s mu ca hai nhm. Chng ta c th dng R tr li cu hi trn bng hm t.test nh sau: > t.test(igfi~ sex) Welch Two Sample t-test data: igfi by sex t = 0.8412, df = 88.329, p-value = 0.4025 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.46855 25.83627 sample estimates: mean in group Female mean in group Male

    167.9741 160.2903

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    63

    R trnh by cc gi tr quan trng trc ht:

    t = 0.8412, df = 88.329, p-value = 0.4025 df l bc t do. Tr s p = 0.4025 cho thy mc khc bit gia hai nhm nam v n khng c ngha thng k (v cao hn 0.05 hay 5%).

    95 percent confidence interval: -10.46855 25.83627

    l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho bit igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam gii khong 25.8 ng/L. V khc bit qu ln v l thm bng chng cho thy khng c khc bit c ngha thng k gia hai nhm.

    Kim nh trn da vo gi thit hai nhm nam v n c khc phng sai. Nu chng ta c l do cho rng hai nhm c cng phng sai, chng ta ch thay i mt thng s trong hm t vi var.equal=TRUE nh sau: > t.test(igfi~ sex, var.equal=TRUE) Two Sample t-test data: igfi by sex t = 0.7071, df = 98, p-value = 0.4812 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -13.88137 29.24909 sample estimates: mean in group Female mean in group Male 167.9741 160.2903 V mc s, kt qu phn tch trn c khc cht t so vi kt qu phn tch da vo gi nh hai phng sai khc nhau, nhng tr s p cng i n mt kt lun rng khc bit gia hai nhm khng c ngha thng k. 9.4 Kim nh Wilcoxon cho hai mu (wilcox.test)

    Kim nh t da vo gi thit l phn phi ca mt bin phi tun theo lut phn phi chun. Nu gi nh ny khng ng, kt qu ca kim nh t c th khng hp l (valid). kim nh phn phi ca igfi, chng ta c th dng hm shapiro.test nh sau: > shapiro.test(igfi) Shapiro-Wilk normality test

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    64

    data: igfi W = 0.8528, p-value = 1.504e-08 Tr s p nh hn 0.05 rt nhiu, cho nn chng ta c th ni rng phn phi ca igfi khng tun theo lut phn phi chun. Trong trng hp ny, vic so snh gia hai nhm c th da vo phng php phi tham s (non-parametric) c tn l kim nh Wilcoxon, v kim nh ny (khng nh kim nh t) khng ty thuc vo gi nh phn phi chun. > wilcox.test(igfi ~ sex) Wilcoxon rank sum test with continuity correction data: igfi by sex W = 1125, p-value = 0.6819 alternative hypothesis: true mu is not equal to 0 Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm nam v n khng c ngha thng k. Kt lun ny cng khng khc vi kt qu phn tch bng kim nh t. 9.5 Kim nh t cho cc bin s theo cp (paired t-test, t.test)

    Kim nh t va trnh by trn l cho cc nghin cu gm hai nhm c lp nhau (nh gia hai nhm nam v n), nhng khng th ng dng cho cc nghin cu m mt nhm i tng c theo di theo thi gian. Ti tm gi cc nghin cu ny l nghin cu theo cp. Trong cc nghin cu ny, chng ta cn s dng mt kim nh t c tn l paired t-test.

    V d 12. Mt nhm bnh nhn gm 10 ngi c iu tr bng mt thuc

    nhm gim huyt p. Huyt p ca bnh nhn c o lc khi u nghin cu (lc cha iu tr), v sau khi iu kh. S liu huyt p ca 10 bnh nhn nh sau: Trc khi iu tr (x0) 180, 140, 160, 160, 220, 185, 145, 160, 160, 170 Sau khi iu tr (x1) 170, 145, 145, 125, 205, 185, 150, 150, 145, 155 Cu hi t ra l bin chuyn huyt p trn c kt lun rng thuc iu tr c hiu qu gim p huyt. tr li cu hi ny, chng ta dng kim nh t cho tng cp nh sau: > # nhp d kin > before after bp # kim nh t > t.test(before, after, paired=TRUE)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    65

    Paired t-test data: before and after t = 2.7924, df = 9, p-value = 0.02097 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.993901 19.006099 sample estimates: mean of the differences 10.5 Kt qu trn cho thy sau khi iu tr p sut mu gim 10.5 mmHg, v khong tin cy 95% l t 2.0 mmHg n 19 mmHg, vi tr s p = 0.0209. Nh vy, chng ta c bng chng pht biu rng mc gim huyt p c ngha thng k. Ch nu chng ta phn tch sai bng kim nh thng k cho hai nhm c lp di y th tr s p = 0.32 cho bit mc gim p sut khng c ngha thng k! > t.test(before, after) Welch Two Sample t-test data: before and after t = 1.0208, df = 17.998, p-value = 0.3209 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.11065 32.11065 sample estimates: mean of x mean of y 168.0 157.5 9.6 Kim nh Wilcoxon cho cc bin s theo cp (wilcox.test) Thay v dng kim nh t cho tng cp, chng ta cng c th s dng hm wilcox.test cho cng mc ch: > wilcox.test(before, after, paired=TRUE) Wilcoxon signed rank test with continuity correction data: before and after V = 42, p-value = 0.02291 alternative hypothesis: true mu is not equal to 0 Kt qu trn mt ln na khng nh rng gim p sut mu c ngha thng k vi tr s (p=0.023) chng khc my so vi kim nh t cho tng cp.

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    66

    9.7 Tn s (frequency)

    Hm table trong R c chc nng cho chng ta bit v tn s ca mt bin s mang tnh phn loi nh sex v ethnicity. > table(sex) sex Female Male 69 31 > table(ethnicity) ethnicity African Asian Caucasian Others 8 60 30 2

    Mt bng thng k 2 chiu: > table(sex, ethnicity) ethnicity sex African Asian Caucasian Others Female 4 43 22 0 Male 4 17 8 2 Ch trong cc bng thng k trn, hm table khng cung cp cho chng ta s phn trm. tnh s phn trm, chng ta cn n hm prop.table v cch s dng c th minh ho nh sau: # to ra mt object tn l freq cha kt qu tn s > freq freq ethnicity sex African Asian Caucasian Others Female 4 43 22 0 Male 4 17 8 2 # dng hm margin.table xem kt qu > margin.table(freq, 1) sex Female Male 69 31 > margin.table(freq, 2) ethnicity African Asian Caucasian Others

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    67

    8 60 30 2 # tnh phn trm bng hm prop.table > prop.table(freq, 1) ethnicity sex African Asian Caucasian Others Female 0.05797101 0.62318841 0.31884058 0.00000000 Male 0.12903226 0.54838710 0.25806452 0.06451613 Trong bng thng k trn, prop.table tnh t l sc tc cho tng gii tnh. Chng hn nh n gii (female), 5.8% l ngi Phi chu, 62.3% l ngi chu, 31.8% l ngi Ty phng da trng . Tng cng l 100%. Tng t, nam gii t l ngi Phi chu l 12.9%, chu l 54.8%, v.v # tnh phn trm bng hm prop.table > prop.table(freq, 2) ethnicity sex African Asian Caucasian Others Female 0.5000000 0.7166667 0.7333333 0.0000000 Male 0.5000000 0.2833333 0.2666667 1.0000000 Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc. Chng hn nh trong nhm ngi chu, 71.7% l n v 28.3% l nam. # tnh phn trm cho ton b bng > freq/sum(freq) ethnicity sex African Asian Caucasian Others Female 0.04 0.43 0.22 0.00 Male 0.04 0.17 0.08 0.02 9.8 Kim nh t l (proportion test, prop.test, binom.test) Kim nh mt t l thng da vo gi nh phn phi nh phn (binomial distribution). Vi mt s mu n v t l p, v nu n ln (tc hn 50 chng hn), th phn phi nh phn c th tng ng vi phn phi chun vi s trung bnh np v phng sai np(1 p). Gi x l s bin c m chng ta quan tm, kim nh gi thit p = c th s dng thng k sau y:

    ( )1x nz

    n

    =

    y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Cng c th ni z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1.

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    68

    V d 13. Trong nghin cu trn, chng ta thy c 69 n v 31 nam. Nh vy t l n l 0.69 (hay 69%). kim nh xem t l ny c tht s khc vi t l 0.5 hay khng, chng ta c th s dng hm prop.test(x, n, ) nh sau: > prop.test(69, 100, 0.50) 1-sample proportions test with continuity correction data: 69 out of 100, null probability 0.5 X-squared = 13.69, df = 1, p-value = 0.0002156 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.5885509 0.7766330 sample estimates: p 0.69 Trong kt qu trn, prop.test c tnh t l n gii l 0.69, v khong tin cy 95% l 0.588 n 0.776. Gi tr Chi bnh phng l 13.69, vi tr s p = 0.00216. Nh vy, nghin cu ny c t l n cao hn 50%. Mt cch tnh chnh xc hn kim nh t l l kim nh nh phn bionom.test(x, n, ) nh sau: > binom.test(69, 100, 0.50) Exact binomial test data: 69 and 100 number of successes = 69, number of trials = 100, p-value = 0.0001831 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.5896854 0.7787112 sample estimates: probability of success 0.69 Ni chung, kt qu ca kim nh nh phn khng khc g so vi kim nh Chi bnh phng, vi tr s p = 0.00018, chng ta cng c bng chng kt lun rng t l n gii trong nghin cu ny tht s cao hn 50%. 9.9 So snh hai t l (prop.test, binom.test) Phng php so snh hai t l c th khai trin trc tip t l thuyt kim nh mt t l va trnh by trn. Cho hai mu vi s i tng n1 v n2, v s bin c l x1 v x2. Do , chng ta c th c tnh hai t l p1 v p2. L thuyt xc sut cho php chng ta pht biu rng khc bit gia hai mu d = p1 p2 tun theo lut phn phi chun vi s trung bnh 0 v phng sai bng:

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    69

    ( )1 2

    1 1 1dV p pn n = +

    Trong :

    1 2

    1 2

    x xpn n

    += + Thnh ra, z = d/Vd tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Ni cch khc, z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1. Do , chng ta cng c th s dng prop.test kim nh hai t l.

    V d 14. Mt nghin cu c tin hnh so snh hiu qu ca thuc chng gy xng. Bnh nhn c chia thnh hai nhm: nhm A c iu tr gm c 100 bnh nhn, v nhm B khng c iu tr gm 110 bnh nhn. Sau thi gian 12 thng theo di, nhm A c 7 ngi b gy xng, v nhm B c 20 ngi gy xng. Vn t ra l t l gy xng trong hai nhm ny bng nhau (tc thuc khng c hiu qu)? kim nh xem hai t l ny c tht s khc nhau, chng ta c th s dng hm prop.test(x, n, ) nh sau: > fracture total prop.test(fracture, total) 2-sample test for equality of proportions with continuity correction data: fracture out of total X-squared = 4.8901, df = 1, p-value = 0.02701 alternative hypothesis: two.sided 95 percent confidence interval: -0.20908963 -0.01454673 sample estimates: prop 1 prop 2 0.0700000 0.1818182 Kt qu phn tch trn cho thy t l gy xng trong nhm 1 l 0.07 v nhm 2 l 0.18. Phn tch trn cn cho thy xc sut 95% rng khc bit gia hai nhm c th 0.01 n 0.20 (tc 1 n 20%). Vi tr s p = 0.027, chng ta c th ni rng t l gy xng trong nhm A qu tht thp hn nhm B. 9.10 So snh nhiu t l (prop.test, chisq.test) Kim nh prop.test cn c th s dng kim nh nhiu t l cng mt lc. Trong nghin cu trn, chng ta c 4 nhm sc tc v tn s cho tng gii tnh nh sau: > table(sex, ethnicity)

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    70

    ethnicity sex African Asian Caucasian Others Female 4 43 22 0 Male 4 17 8 2 Chng ta mun bit t l n gii gia 4 nhm sc tc c khc nhau hay khng, v tr li cu hi ny, chng ta li dng prop.test nh sau: > female total prop.test(female, total) 4-sample test for equality of proportions without continuity correction data: female out of total X-squared = 6.2646, df = 3, p-value = 0.09942 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 0.5000000 0.7166667 0.7333333 0.0000000 Warning message: Chi-squared approximation may be incorrect in: prop.test(female, total) Tuy t l n gii gia cc nhm c v khc nhau ln (73% trong nhm 3 (ngi da trng) so vi 50% trong nhm 1 (Phi chu) v 71.7% trong nhm chu, nhng kim nh Chi bnh phng cho bit trn phng din thng k, cc t l ny khng khc nhau, v tr s p = 0.099. 9.10.1 Kim nh Chi bnh phng (Chi squared test, chisq.test) Tht ra, kim nh Chi bnh phng cn c th tnh ton bng hm chisq.test nh sau: > chisq.test(sex, ethnicity) Pearson's Chi-squared test data: sex and ethnicity X-squared = 6.2646, df = 3, p-value = 0.09942 Warning message: Chi-squared approximation may be incorrect in: chisq.test(sex, ethnicity) Kt qu ny hon ton ging vi kt qu t hm prop.test.

  • Phn tch s liu v biu bng R Nguyn Vn Tun

    71

    9.10.2 Kim nh Fisher (Fishers exact test, fisher.test) Trong kim nh Chi bnh phng trn, chng ta ch cnh bo: Warning message: Chi-squared approximation may be incorrect in: prop.test(female, total) V trong nhm 4, khng c n gii cho nn t l l 0%. Hn na, trong nhm ny ch c 2 i tng. V s lng i tng qu nh, cho nn cc c tnh thng k c th khng ng tin cy. Mt phng php khc c th p dng cho cc nghin cu vi tn s thp nh trn l kim nh fisher (cn gi l Fishers exact test). Bn c c th tham kho l thuyt ng sau kim nh fisher hiu r hn v logic ca phng php ny, nhng y, chng ta ch quan tm n cch dng R tnh ton kim nh ny. Chng ta ch n gin lnh: > fisher.test(sex, ethnicity) Fisher's Exact Test for Count Data data: sex and ethnicity p-value = 0.1048 alternative hypothesis: two.sided Ch tr s p t kim nh Fisher l 0.1048, tc rt gn v