r 통계¹…데이터...대한 추정/가설검정 (population variances known) •10.2 두 개...

R 통계

빅데이터분석교육(2015-11)

R 통계 (I)

• 데이터와 데이터 분석 – 데이터의 의미 – 단변량/2변량/다변량 데이터

• 기술통계

• 시뮬레이션과 확률

• 확률분포와 표본추출 – 이산분포 – 연속분포 – 표본추출과 표본분포

• 추정 – 신뢰구간 – 가설검정


R 통계 (II)

• 회귀분석 – SLR, MLR

• 분산분석 (ANOVA)

• 범주형 데이터 분석 – Χ2 적합성 검정 – Χ2 독립성 검정

• 비모수통계

• 시계열분석

• 다변량 분석 – 벡터와 Matrix – Interdependence 분석 (PCA, 요인분석, MDS) – Dependence 분석 (SEM, 판별분석, Canonical Correlation, …)


• Unit I: 개요

– 1 도입

– 2 Chart와 그래프

– 3 기술통계

– 4 확률

• Unit II: 분산과 표본추출

– 5 이산 ((Discrete) 분포

– 6 연속 (Continuous) 분포

– 7 표본추출과 표본분포

• Unit III: 모수추정

– 8 신뢰구간 추정 (단일 모집단)

– 9 가설검정 (단일 모집단)

– 10 추정 (2개 모집단)

– 11 분산분석과 실험계획

• Unit IV: 회귀분석과 예측

– 12 단순회귀분석과 상관관계

– 13 다중회귀분석

– 14 다중회귀모델

– 15 시계열예측

– 16 범주형 데이터 분석


개요


1.1 기본개념

• population • sample • parameter = a descriptive measure of the population • statistic = a descriptive measure of a sample


1.2 Data Measurement

• (Levels of data measurement)

– Nominal Level

– Ordinal Level

– Interval Level

– Ratio Level

• 비교


2 Chart와 그래프

• 2.1 Frequency Distributions – Class Midpoint – Relative Frequency – Cumulative Frequency

• 2.2 Quantitative Data Graphs – Histograms – Frequency Polygons – Ogives – Pie Charts – Stem-and-Leaf Plots – Pareto Charts

• 2.3 Graphical Depiction of Two-Variable: Numerical Data – Scatter Plots


3 기술(記述)통계

• 3.1 Central Tendency: Ungrouped Data

– Mode, Median, Mean

– Percentiles, Quartiles

• 3.2 Variability: Ungrouped Data

– Range, IQR

– Z-score, Coeff’t of Variation

• 3.3 Central Tendency and Variability: Grouped Data

– Mean, Mode

• 3.4 Measures of Shape

• 3.5 Measures of Association


• 3.4 Measures of Shape – Skewness

• Skewness & the Relationship of the Mean, Median, and Mode

• Coefficient of Skewness

– Kurtosis


• Box-and-Whisker Plots


4 확률

• 4.1 기본개념

• 4.2 Marginal, Union, Joint 확률

• 4.3 Addition & Multiplication Laws

• 4.4 조건부 확률과 Bayes’ Rule


4.1 기본개념

• Experiment, Event, Elementary Events • Sample Space • Unions and Intersections • Mutually Exclusive Events • Independent Events • Collectively Exhaustive Events

– MECE (Mutually Exclusive Collectively Exhaustive) ; 중복 없고, 누락 없이 전체집합

• Complementary Events • Counting the Possibilities

– mn Counting Rule – Sampling from a Population with Replacement – Combinations: Sampling from a Population Without Replacement

• NCn = 𝑁!

𝑛! 𝑁−𝑛 !


4.2 Marginal, Union, Joint, 조건부 확률


4.3 Addition 및 Multiplication Laws

• Addition • General Law of Addition:

– P(X ∪ Y) = P(X) + P(Y) - P(X ⋂ Y) » where X, Y are events and P(X ⋂ Y) is the intersection of X and Y.

– Probability Matrices • Complement of a Union

– If X, Y are mutually exclusive, P(X ∪ Y) = P(X) + P(Y)

– Special Law of Addition • If X, Y are mutually exclusive, • P(X ∪ Y) = P(X) + P(Y)

• Multiplication – General Law of Multiplication

– P(X ⋂ Y) = P(X) • P(Y|X) = P(Y) • P(X|Y)

– Special Law of Multiplication – If X, Y are independent, P(X ⋂ Y) = P(X) • P(Y)


4.4 조건부 확률과 Bayes’ Rule

• Law of Conditional Probability – P(X | Y) = (P(X ∩ Y))/(P(Y)) = (P(X)•((Y|X))/(P(Y))

• Independent Events – 독립사상 여부의 검정:

• P(X | Y) = P(X) and P(Y| X) = P(Y)

• Bayes’ Rule • To test to determine if X and Y are independent events, the

following must

– P(Xi | Y) = 𝑃 𝑋𝑖 •𝑃(𝑌|𝑋𝑖)

𝑃 𝑋1 •𝑃 𝑌 𝑋1 + 𝑃 𝑋2 •𝑃 𝑌 𝑋2 +⋯+𝑃 𝑋𝑛 •𝑃(𝑌|𝑋𝑛)

» Denominator: “total probability formula”

• Odds • 만일 주사위를 던져서 2 또는 3이 나올 확률은 = 2/6 = 1/3 이

지만, 2 또는 3이 나올 odds는 = 2/4 = 1/2 가 됩니다. – (여기서 4는... 1,4,5,6이 나오는 경우의 수 총 네 가지)


Unit II: 분산과 표본추출

• 5 이산분포

• 6 연속분포

• 7 표본추출과 표본분포


5 이산분포

• 5.1 Describing a Discrete Distribution – Mean or Expected Value

• = long-run average of occurrences

– Variance and Standard Deviation of a Discrete Distribution

• 5.2 Binomial Distribution

• 5.3 Poisson Distribution

• 5.4 Hypergeometric Distribution – = 유한 모집단으로부터 비복원출원 시 나타나는 확률분포


6 연속분포

• 6.1 The Uniform Distribution

• 6.2 Normal Distribution

• 6.3 Using the Normal Curve to Approximate Binomial Distribution Problems

• 6.4 Exponential Distribution


6.1 The Uniform Distribution

• Determining Probabilities in a Uniform Distribution

• Using the Computer to Solve for Uniform Distribution Probabilities


6.2 Normal Distribution

• History of the Normal Distribution • = Gaussian distribution = normal curve of error - “the errors of

repeated measurements are often normally distributed”

• Probability Density Function of the Normal Distribution

• Standardized Normal Distribution

• Solving Normal Curve Problems • Using the Computer to Solve for Normal Distribution

Probabilities


6.3 Using Normal Curve to Approximate Binomial Distribution

Problems

• Empirical rule ; approximately 99.7% of the values of

a normal curve are within 3 standard deviations of the mean.

• Another rule of thumb ; the approximation is good enough if both n • p > 5 and n • q > 5

• Correcting for Continuity • ; Converting discrete distribution into a continuous

distribution.


6.4 Exponential Distribution

• (Exponential Distribution) – = (continuous distribution으로서) describes a probability distribution

of times between random occurrences

– cf. Poisson distribution = (discrete로서) describes random occurrences over some interval

• Probabilities of the Exponential Distribution • Using the Computer to Determine Exponential Distribution

Probabilities


7 표본추출과 표본분포

• 7.2 Sampling Distribution of 𝑥

• 7.3 Sampling Distribution of p


7.1 𝑥 의 표본분포

• [Central Limit Theorem] • 𝜇𝑥 = μ

• 𝜎𝑥 = 𝜎

𝑛

• [z Formula for Sample Means]

• Sampling from a Finite Population


7.2 𝑝 의 표본분포

• (…) • measurable data sample mean

• countable items sample proportion

• [Sample Proportion]

• [z formula for Sample Proportion for n.p >5 and n.q > 5]


UNIT III: 모수추정


8 신뢰구간 추정 (단일 모집단)

• 8.1 z 통계량을 이용한 신뢰구간 추정 (단일 모집단) (σ Known)

• 8.2 t 통계량을 이용한 신뢰구간 추정 (단일 모집단) (σ Unknown)

• 8.3 모비율 추정

• 8.4 모분산 추정

• 8.5 표본크기의 산정


8.1 z 통계량을 이용한 신뢰구간 추정 (단일 모집단) (σ Known)

• ( 100(1-α)% Confidence Interval to Estimate μ: σ known] )

• 유한조정계수

• Sample Size가 작은 경우 • 여태까지 주로 n ≥ 30 • n < 30 이어도 중심극한정리에 의해 z formula 적용 : • sample size가 클 때 또는 작아도 population이 normal

distribution (σ known)


8.2 t 통계량을 이용한 신뢰구간 추정 (단일 모집단) (σ Unknown)

• t 분포 • 모집단이 정규분포인데 모집단 s.d 를 모르는 경우 t 분포 적용.

– Every sample size has a different distribution. An assumption underlying the use of the t statistic is that the population is normally distributed.

– If population is not normal dist. or is unknown, nonparametric techniques

• Robustness • t Distribution의 특징 • Confidence Intervals to Estimate the Population Mean Using

the t Statistic


8.3 모비율 추정


8.4 모분산 추정

• (…) • Sample Variance

• 모분산과 표본분산의 관계: χ2 분포 – The ratio of the sample variance (s2) multiplied by n-1

to the population variance (σ2) is approximately chi-square distributed, if the population from which the values are drawn is normally distributed.


8.5 표본크기의 산정

• Sample Size when Estimating μ 추정 시의 표본크기

• When μ is being estimated, the size of sample can be determined by using the z formula for sample means to solve for n.

• The difference between 𝑥 and μ is the error of estimation resulting from the sampling process. Let E = (𝑥 - μ) = the error of estimation. Substituting E into the preceding formula yields

• sample size 결정을 위해 n을 풀이하면:

• …

• p 추정 시의 표본크기


9 가설검정 (단일 모집단)

• 9.1 개요

• 9.2 z 통계량 이용한 모평균의 가설검정 (σ Known)

• 9.3 t 통계량 이용한 모평균의 가설검정 (σ Known)( Unknown)

• 9.4 비율에 관한 가설검정

• 9.5 분산에 관한 가설검정

• 9.6 Type II Errors


9.1 개요

• Types of Hypotheses

• Research Hypotheses

• Statistical Hypotheses

– H0 Ha

• Substantive Hypotheses

• Using HTAB System to Test Hypotheses


• Rejection and Nonrejection Regions

• Type I and Type II Errors


9.2 z 통계량 이용한 모평균의 가설검정

(σ Known) • [z Test for a Single Mean]

• Testing the Mean with a Finite Population

• Using the p-Value to Test Hypotheses • p-value = observed significance level = 관측된 유의수준 (level of

significance) • 사전에 α가 주어지는 대신 H0가 true라는 가정하에 확률을 계산

The p-value defines the smallest value of alpha for which the null hypothesis can be rejected.

• (예) if p-value of a test is .038, H0 cannot be rejected at a = .01 because .038 is the smallest value of alpha for which the null hypothesis can be rejected. However, H0 can be rejected for a = .05.

• “α 가 p보다 커야만 H0를 reject 가능”


• Using the Critical Value Method to Test Hypotheses – [Rejecting the H0 using p-values]

• Using the Computer to Test Hypotheses About a Population Mean Using the z Statistic


9.3 t 통계량 이용한 모평균의 가설검정

(σ Known)( Unknown) • (…)

– [z Test of a Population Proportion]

• Using the Computer to Test Hypotheses About a Population Mean Using the t Test


9.4 비율에 관한 가설검정

• […] – Using p-value

– Using the critical value method

• Using the Computer to Test Hypotheses About a Population Proportion


9.5 분산에 관한 가설검정

• Table χ2 • Observed χ2

• The null hypothesis can also be tested by the critical value

method. • Observed χ2 value 대신 critical χ2 value for alpha를 적용하여

s2 계산 yields critical sample variance (sc2)


9.6 Type II Errors

• Some Observations About Type II Errors


• Operating Characteristics and Power Curves

• Effect of Increasing Sample Size on the Rejection Limits • Increased sample size not only affects the distance of the critical raw

score value from the hypothesized value of the distribution, but also can result in reducing β for a given value of α.


10. 2개 모집단 추정

• 10.1 z 통계량을 이용한 두 개 평균 차에 대한 추정/가설검정 (Population Variances Known)

• 10.2 두 개 평균 차에 대한 추정/가설검정: 독립표본이고 모분산을 알 때

• 10.3 서로 관련된 모집단에 대한 추정

• 10.4 두 개 모비율에 대한 추정(p1 - p2)

• 10.5 두 개 모분산에 대한 추정


10.1 z 통계량을 이용한 두 개 평균 차에 대한 추정/가설검정 (Population Variances Known)

• (…) • Central limit theorem states that the difference in two sample means,

𝑥 1 − 𝑥 2, is normally distributed for large sample sizes (both n1 and n2 ≥ 30) regardless of the shape of the populations. It can also be shown that

• z formula for the difference in two sample means

• Hypothesis Testing • H0: μ1 – μ2 =δ • Ha: μ1 – μ2 ≠δ

• Confidence Intervals

• Using the Computer to Test Hypotheses About the Difference in Two Population Means Using the z Test


10.2 두 개 평균 차에 대한 추정/가설검정: 독립표본이고 모분산을 알 때

• Hypothesis Testing

• Using the Computer to Test Hypotheses and Construct Confidence Intervals About the Difference in Two Population Means Using the t Test



10.3 서로 관련된 모집단에 대한 추정

• (종류) • Before-and-after study • Matched-pair with built-in relatedness, as an experimental

control mechanism (ex) twins, siblings


• Using the Computer to Make Statistical Inferences About Two Related Populations



10.4 두 개 모비율에 대한 추정(p1 - p2)

• (…)




10.5 두 개 모분산에 대한 추정

• (…)


11 분산분석과 실험계획

• 11.1 실험계획

• 11.2 The Completely Randomized Design (One-Way ANOVA)

• 11.3 Multiple Comparison Tests

• 11.4 The Randomized Block Design

• 11.5 A Factorial Design (Two-Way ANOVA)


11.1 Design of Experiments

• experimental design – = a plan and a structure to test hypotheses in which the

researcher either controls or manipulates one or more variables.

• (independent variable) may be either a treatment variable or a classification variable. – A treatment variable is a variable the experimenter controls or

modifies in the experiment. – A classification variable is some characteristic of the

experimental subject that was present prior to the experiment and is not a result of the experimenter’s manipulations or control. (=factors).

– Each independent variable has two or more levels, or classifications. (= subcategories of the independent var’s)


11.2 The Completely Randomized Design (One-Way ANOVA)

• One-Way Analysis of Variance • H0: μ1 = μ2 = μ3 = … = μk

• Ha: At least one of the means is different from the others.


• Reading the F Distribution Table

• ANOVA tests are always one-tailed tests • “Observed F value” vs. “Critical value of F test” (=Table F value)

(d.f.에 의해 참조되는 값) • Reject H0 if (observed F > critical F)

• Comparison of F and t Values – F = t2 for dfC = 1


11.3 Multiple Comparison Tests

• (…) • ANOVA는multiple group의 differences of mean에 대한 가설검정에 유용

– (장점) Type I error, α, is controlled

• Tukey’s Honestly Significant Difference (HSD) Test: The Case of Equal Sample Sizes

• Tukey-Kramer Procedure: The Case of Unequal Sample Sizes 빅데이터분석교육(2015-11)

11.4 The Randomized Block Design

• (…) • Indep’t variable (= treatment var.) + Blocking variable (to control

confounding/concomitant variable)

• CRD와 비슷하지만 also includes a blocking variable, that can be used to control for confounding or concomitant variables.


RBD CRD

F = 𝑀𝑆𝐶

𝑀𝑆𝐸

그런데 MSE는 MSR을 분리시킨 후 적용

되므로 작아지고 결과적으로 F값은 증가

F = 𝑀𝑆𝐶

𝑀𝑆𝐸

그런데 MSE는 MSR이 포함되므로 커지고

결과적으로 F값은 감소 빅데이터분석교육(2015-11)

11.5 A Factorial Design (Two-Way ANOVA)

• Advantages of the Factorial Design

CRD RBD Factorial Design

각 변수의 Effect를 별도로

분석 (one per design).

즉, Var’s are studied in is

olation

Analyze both variables at the same time in o

ne design.

Confounding or concomitant variable을 하나

의 study에서 control 가능 Additional effec

ts of the second variable are removed from t

he SSE. 즉, there is potential for increased p

ower over the completely randomized desig

n because the additional effects of the secon

d variable are removed from the error sum o

f squares.

단, focus on one treatment

variable & control for the b

locking effect Interaction

분석 가능

FD with 2 treatments are similar to RBD.

Focus on the effects of both variables.

(2 treatment변수 간의 interaction 분석 가능,

if multiple measurements are taken under ev

ery combination of levels of 2 treatment)


• Factorial Designs with Two Treatments

• Applications

• Statistically Testing the Factorial Design • Row effects: H0: Row means all are equal. Ha: At least one

row is different.

• Column effects: H0: Col. means are all equal. Ha: At least one col is different.

• Interaction effects: H0: Interaction effects =0. Ha: Interaction effect is present.

• Each of these observed F values is compared to a table F value.

• The table F value is determined by a, dfnum, and dfdenom.


• Interaction

• Using a Computer to Do a Two-Way ANOVA


Unit IV: REGRESSION ANALYSIS AND FORECASTING

• 12 Simple Regression Analysis and Correlation

• 13 Multiple Regression Analysis

• 14 Building Multiple Regression Models

• 15 Time-Series Forecasting and Index Numbers


12 단순회귀분석과 Correlation

• 12.1 Correlation

• 12.2 단순회귀분석 개요

• 12.3 회귀직선 방정식

• 12.4 Residual Analysis

• 12.5 Standard Error of the Estimate

• 12.6 Coefficient of Determination

• 12.7 Regression 모델 기울기 등에 대한 가설검정

• 12.8 Estimation 빅데이터분석교육(2015-11)

12.1 Correlation


12.2 단순회귀분석 개요


12.3 회귀직선 방정식

• A deterministic regression model is y = β0 + β1x

• The probabilistic regression model is y = β0 + β1x + ε

• Equation of SLR line:


12.4 Residual Analysis

• (…)


• Using Residuals to Test the Assumptions of the Regression Model

• Using the Computer for Residual Analysis


12.5 Standard Error of the Estimate

• 모델의 error분석을 위해 Residuals (= 개별 point

에 대한 estimation errors)을 계산하는 대신 standard error of the estimate 이용.


12.6 Coefficient of Determination

• (…) • r2=0 means predictor accounts for none of the

variability of the dependent variable and that there is no regression prediction of y by x.

• r2= 1 means perfect prediction of y by x and that 100% of the variability of y is accounted for by x.

• Relationship Between r and r2 • r2 = (r)2


12.7 Regression 모델 기울기 등에 대한 가설검정

• Testing the Slope • The hypotheses for this test:

– H0: β1 = 0, Ha: β1 ≠ 0

• This test is two tailed. – H0: β1 = 0, Ha: β1 > 0 AND H0: β1 = 0, Ha: β1 < 0

• In each case, testing the H0 involves a t test of the slope.


• 모델에 대한 전반적 검정 • F test to determine the overall significance of the model.

– In multiple regression, this test determines whether at least one of the regression coefficients (from multiple predictors) is different from zero.

– Simple regression provides only one predictor and only one regression coefficient to test.

• Because regression coefficient is the slope of the regression line, F test for overall significance is testing the same thing as the t test in simple regression.

– H0: β1 = 0

– Ha: β1 ≠ 0

• In the case of simple regression analysis, F = t2. Thus, for the airline cost example, the F value is

– F = t2 = (9.43)2 = 88.92

• The F value is computed directly by


12.8 Estimation

• Confidence Intervals to Estimate the Conditional Mean of y: μy|x

• 회귀추정식에 x 값을 대입하면 되지만 실제로는 point estimate 보다는 a sample set of points에 의함. Hence computing a confidence interval for the estimation is often useful.

• Prediction Intervals to Estimate a Single Value of y


13 다중회귀분석

• 13.1 Multiple Regression Model

• 13.2 신뢰성 검정과 Regression 모델

• 13.3 Residuals, Standard Error of the Estimate, and R2


13.1 Multiple Regression Model

• (…) • Simple regression model: y =β0 + β1x +ε

• Multiple regression model: y =β0 + β1x1 + β2x2 + …+ βkxk +ε

• Multiple Regression Model with Two Independent Variables (First Order)

• Determining the Multiple Regression Equation

• A Multiple Regression Model


13.2 신뢰성 검정과 Regression 모델

• (…) • testing the overall significance of the model,

• studying the significance tests of the regression coefficients,

• computing the residuals,

• examining the standard error of the estimate,

• observing the coefficient of determination.

• Testing the Overall Model • H0: β1 =0

• Ha: β1 ≠0

• Significance Tests of the Regression Coefficients


13.3 Residuals, Standard Error of the Estimate, and R2

• Residuals • Residual = y - y

• SSE and Standard Error of the Estimate • Standard Error of Estimate = 추정 값의 표준오차 = 추정표준오차 = 표준추

정오차 = 차이의 표준오차 = 최적선에 대한 산포도에서 점들의 분산도 = y 를 중심으로 실제 y 점수 분포가 어느 정도인가 표시

• SSE =Σ(y -y )2

• Coefficient of Multiple Determination (R2)

• Adjusted R2


Unit V NONPARAMETRIC STATISTICS

• 14 Analysis of Categorical Data

• 15 Nonparametric Statistics (생략)


14 Analysis of Categorical Data

• 14.1 Chi-Square Goodness-of-Fit Test

• 14.2 Contingency Analysis: Chi-Square Test of Independence


14.1 Chi-Square Goodness-of-Fit Test

• (…)

df 값이 작아질수록

skewed to the right

df 값이 커질수록

정규분포와 유사해진다

자유도 적용 값

uniform distribution 을 가 정 할 때

or expected distribution이 주어졌을 때

k-1

관측된 것이 Poisson 분포인지 알아보는 경우 k-2 λ 추정

관측된 것이 normal 분포인지 알아보는 경우 k-3 μ, σ 추정 빅데이터분석교육(2015-11)

• Testing a Population Proportion by Using the Chi-Square Goodness-of-Fit Test as an Alternative Technique to the z Test


14.2 Contingency Analysis: Chi-Square Test of Independence

• (…) • χ2 goodness-of-fit Test; analyze the distribution of frequencies for

categories of one variable, such as age or number of bank arrivals, to determine whether the distribution of these frequencies is the same as some hypothesized or expected distribution. However, the goodness-of-fit test cannot be used to analyze two variables simultaneously.

• χ2 test of independence is to analyze the frequencies of 2 variables with multiple categories to determine whether the two variables are independent. (활용예: …)

• (예) On a questionnaire, In which region of the country do you reside? A. Northeast B. Midwest C. South D. West Which type of financial investment are you most likely to make today? E. Stocks F. Bonds G. Treasury Bills

• The business researcher would tally the frequencies of responses to these two questions into a two-way table called a contingency table. Because the chi-square test of independence uses a contingency table, this test is sometimes referred to as contingency analysis.


r 통계¹…데이터...대한 추정/가설검정 (population variances known) •10.2 두 개...

Documents