유전자연구통계기법hosting03.snu.ac.kr/~hokim/seminar/asan20070113.pdf · 2007-01-03 · we...
TRANSCRIPT
유전자연구 통계기법
김호
서울대학교 보건대학원
2007/01/13
Human and Chimp
• How Similar ?
• Very Similar ! (99.999% ?)
차례
• 서론 및 일반이론 (Statistical Genetics)
• Linkage Analysis
• SNP association
• Sample Size Problem (Two stage Design)
• SAS Genetics 소개 및 구현의 예
Genotype
Allele frequency
Genotype frequency
Hardy-WeinbergIn a stable population with random mating, allele freq predicts genotype freq.
Goodness-of-fit can be applied to test H-W Equilibrium
• Chi-square Test Ho: 우리의 자료가 특정모형(HWE)을 따른다2p pq
qp 2q
2q
1)(( −−= 개수모수의추정할개수유전형의자유도 )
통계기본이론
자유도 = 범주의 개수-1-추정한 모수의 수
22( - ) χ∑ ∼ df
관찰값 기대값
기대값
• HWE 예제 1
- p = (2ⅹ298+489)/(2ⅹ1000) = 0.5425q = (489+2ⅹ213)/(2ⅹ1000) = 0.4575
- P(A) = p2=(0.5425)2=0.2943P(Aa) = 2pq= 2ⅹ(0.5425)ⅹ(0.4575)=0.4964P(aa) = q2=(0.4575)2=0.2093
- 기대 값 (expected frequency)AA = P(AA)ⅹ1000 = 294.3064Aa = P(Aa)ⅹ1000 = 496.3875aa = P(aa)ⅹ1000 = 209.3063
1110001000total
0.20930.2130209.3063213aa
0.49640.4890496.3875489Aa
0.29430.2980294.3063298AA
기대값관찰값기대값관찰값
빈도개수
• 검정통계량
자유도=3-1-1=1
∴자유도가 1인 카이제곱 분포에 근거가 p값이 0.6379이므로 관찰된 값은 Ho (HWE 상태)를 기각할 수 있는 충분한 근거가 없다. 즉 HWE 상태라고 결론 내린다.
실무에서는 genotype error check의 방법으로 많이 사용된다.
2 2 22 (298 294.3063) (489 496.3875) (213 209.3063)=
294.3063 496.3875 209.30630.2215
χ − − −+ +
=
∑
• Test of association (Odds ratio, Chi-square test)2p pq
qp 2q
2q
1)(( −−= 개수모수의추정할개수유전형의자유도 )
통계기본이론
Nn+2n+1Total
n2+n22n21Control
n1+n12n11Case
Total21
1 1 11 1 12 1 11 12 11 22
2 2 21 2 22 2 21 22 21 12
/(1 ) ( / ) /( / ) /OR =/(1 ) ( / ) /( / ) /
p p n n n n n n n np p n n n n n n n n
+ +
+ +
−= = =
−
ChiChi--square test with square test with dfdf=(#col=(#col--1)(#row1)(#row--1) :1) :
Ho: OR=1 Ho: OR=1
Expected cell freq is bigger than 5, if not use FisherExpected cell freq is bigger than 5, if not use Fisher’’s Exact s Exact test test
• Chi-square 예제1 : Genotype-based
2p pq
qp 2q
2q
1)(( −−= 개수모수의추정할개수유전형의자유도 )
통계기본이론
n1+
n1O
n1A
Mm
Nn0+n2+Total
n+ON0ON2OControl
n+AN0AN2ACase
TotalmmMM
2 0MM/mm
2 0
1 0Mm/mm
1 0
OR =
OR =
A O
O A
A O
O A
n nn nn nn n
=
=
Linkage DisequilibriumAlleles at different sites should occur in a combinations relative to their SNP allele freq
LD Block
Shaw et al. Am J of Medical Genet 114 205-213 (2002)
From SNP to Haplotype
DNA Sequence
GATATTCGTACGGA-TGATGTTCGTACTGAATGATATTCGTACGGA-TGATATTCGTACGGAATGATGTTCGTACTGAATGATGTTCGTACTGAAT
SNP
SNP
1 2
3
4
5 6
AG- 2/6
GTA 3/6AGA 1/6
Haplotypes
PhenotypeBlack eyeBrown eyeBlack eyeBlue eyeBrown eyeBrown eye
In-silico Haplotyping: Approaches
1) Clark’s algorithm
2) E-M algorithm (expectation-maximization algorithm)
3) Bayesian algorithm
Clark’s Algorithm
1) Find Homozygotes or heterozygotes at one locus
SNP1 T T
SNP2 A A
SNP3 C C
T-A-C
T-A-C
SNP1 T T
SNP2 A A
SNP3 C G
T-A-C
T-A-G
Unambiguously defined
Clark’s Algorithm2) Try to solve ambiguous haplotype as a combination of solved ones
SNP1 A T
SNP2 A A
SNP3 C G
T-A-C : solved one
A-A-G
Continue until either all haplotypes have been solved or until no more haplotypes can be found in this way
……………………………
Clark’s Algorithmproblems
• No homozygotes or single SNP heterozygotes -> chain might never get started
•Many unsolved haplotypes left at the end
•Quite useful in practice !!
EM Algorithm• Use multinomial likelihood with HWE
Pr(AT//AA//CG)
=pr(AAC/TAG)+pr(AAG/TAC)
=pr(AAC)pr(TAG)+pr(AAG)pr(TAC)
Falling and Schork(2000) showed that EM is better than Clark’s algorithm
A Gibbs sampler, Stephens et al (2001)
• G=(G1, …, Gn) observed multilocus genotype freq
H=(H1, …, Hn) unknown haplotype pairs
F=(F1, …, FM) M unknown pop’n hap freq
1. Choose individual i from all ambiguous individuals
2. Sample Hi(t+1) from pr(Hi|G,H-i
(t))
3. Set Hj(t+1)=Hj
(t) for j=1,2,…,i-1,i+1,…n
Environmental Effects
Assume 2(0, )i EE NID σ≈
Normal
Bell shaped
Independent (not correlated)
Distribution Mean=0Homogeneous Variance
Assume the phenotype(Y) is the combined effect of the genotype(G) and the environment(E)
i i iY G E= +
Putative gene(locus)
Gene ?Phenotype
Linkage analysisLinkage analysis(LD, (LD, sibpairsibpair et al)et al)
Association studyAssociation study
New GeneNew GeneDiscoveryDiscovery
SegregationSegregation
Heritability
58감성34혈중최대 젖산 농도
66사회적응력44혈중지질 농도
47기억력72아미노산 분비
76계산능력63몸무게
84최대맥박수85키
63언어능력29수명
유전율 (%)형질(Trait)유전율 (%)형질(Trait)
Model
i=1,2,..I the sipship, j=1,2,Fi the members of sipship, deviation of the family form the , deviation of member j from the family mean.
2
2
(0, )
(0, )
ij i ij
i B
ij W
Y X E
X N
E N
µ
σ
σ
= + +
∼∼
ijEiX µ
2
2 2B
B W
σσ σ+Heritability=
A: Additive D:Dominance I:Epistatic
E: Environmental (Common+Independent)
Narrow sense Heritability = A / Total
Broad sense Heritability = Genetic / Total
Variance Component Model
P A D I E G E= + + + + ×
Genetic Nuisance(?)
How to identify the genes
• Family study
– Linkage analysis: pedigree 필요
– Sib pair analysis: oligogenic, multigenic
• Population study
– Case-control association study
21 1
1
11
{ ( )}k
i ii
k
ii
D ET
V
=
=
−=∑
∑
Number of genetic determinants by effects
≠
1 2 ... kt t t< <
Linkage analysis: Key Concepts
父
子
母
父
子
母
Linkage analysis: Key Concepts
父
子
母
Recombination
Stochastic
Stochastic
Stochastic
…….
子1 子2 …
Stochastic
Linkage analysis: Key Concepts
• PARAMETRIC LINKAGE ANALYSISTo estimate the recombination fraction between markers and a hypothesized trait locus, where inheritance parameters of the trait locus (mode of inheritance, penetrance, phenocopy rate, allele frequencies etc) must be specified.
Ex. Lod score method
• LOD SCORE
The common logarithm of the likelihood ratio:
Z(θ) = log10 [L(θ ) / L(½)]
where θ is the recombination fraction between two loci
• Purpose Of The Lod Score Method
1. Estimation of the recombination fraction, θ
2. Hypothesis testing
H0: θ = ½ (absence of linkage)
H1: θ < ½ (linkage)
max 10( ) log [ ( ) / (1/ 2)]Z Z L L= =θ̂ θ̂
• Scale For Testing Linkage
Zmax ≥ 3 : Strong linkage
Zmax > 0 : Support linkage
Zmax < 0 : Against linkage
Zmax = 0 : No support
(not related to recombination in linkage or no linkage)
Phase known pedigree
Figure 2 Phase known pedigree
• The maximum likelihood estimator of is 2/6=1/3
2 46 2 4
10 102 4
(1 )( ) log log 2 (1 )0.5 0.5
Z θ θθ θ θ−= = −
⋅
(1/ 3) 0.1475Z =
θ
Phase-unknown pedigree
Figure 3 Phase-unknown pedigree
• The maximum likelihood estimator of is not so trivial
• The MLE is found to be 0.5 by numerical method
4 2 2 4
2 2 2 2
1 1( ) (1 ) (1 )2 21 = (1 ) [ (1 ) ]2
L θ θ θ θ θ
θ θ θ θ
= − + −
− + −
θ
Genotype Unknown-Phenotype known
Figure 4 Genotype Unknown-Phenotype known
( ; ) Pr( ) Pr( )
Pr( | , ; )
and we know thatPr( ) Pr( ) Pr( | )
ma pa
offs ma paoffspring
ma G
L data Ph Ph
Ph Ph Ph
Ph G Ph G
θ
θ
=
×
=
∏
∑
• NONPARAMETRIC LINKAGE ANALYSISInheritance parameters of the trait locus are not specified. Rather, one focuses on pairs (or multiples) of affected individuals and investigates marker allele sharing among these individuals, contrasting observed allele sharing with that expected when the marker has nothing to do with the trait.
Ex. IBD (identical by descent) test
IBS (Identical by state), IBD (Identical by descent)
Human Genome Epidemiology, Khoury, Little, Burke
AN EXAMPLE FAMILY WITH DISEASE LOCUS AT THE MARKER
3 4+ –
3 2+ –
3 3+ +
3 4+ –
2 3– +
2 4– –
• Only ‘+ +’ indicates as “affected”(‘+’ is recessive to ‘–’)
** Qualitative Trait
Sib-Pair Markers
sib1 sib23 | 3 3 | 33 | 3 3 | 43 | 3 2 | 33 | 3 2 | 43 | 4 3 | 43 | 4 2 | 33 | 4 2 | 42 | 3 2 | 32 | 3 2 | 42 | 4 2 | 4
Disease Status
d1 d2+ ++ -+ -+ -- -- -- -- -- -- -
# ofShared i.b.d.
2110201212
C
10.250.250.250.50.50.50.50.50.5
• Cj = (dj1 – µ) (dj2 – µ)
= α + β IBDj + εj
• Linkage And LD
- The two loci can be assumed to reside on different chromosomes.
The presence of LD does not necessarily imply linkage between the loci considered.
- Although LD originally referred to an association of alleles at different loci, it has become customary to take LD to mean association among alleles due to close linkage. “allelic association”
• Genomewide Linkage Analysis of Bipolar Disorder by Use of a High-Density Single-Nucleotide Polymorphism (SNP) Genotyping Assay: A Comparison with MicrosatelliteMarker Assays and Finding of Significant Linkage to Chromosome 6q22
• F. A. Middleton,1,2,3 M. T. Pato,2,3,4 K. L. Gentile,1,2 C. P. Morley,2 X. Zhao,1,2 A. F. Eisener,2 A. Brown,1,2 T. L. Petryshen,6 A. N. Kirby,5,6 H. Medeiros,2,4 C. Carvalho,2 A. Macedo,8 A. Dourado,8 I. Coelho,8 J. Valente,8 M. J. Soares,8 C. P. Ferreira,9 M. Lei,9 M. H. Azevedo,4 J. L. Kennedy,10 M. J. Daly,5 P. Sklar,6,7 and C. N. Pato2,3,4,9
• Am. J. Hum. Genet., 74:000, 2004
We performed a linkage analysis on 25 extended multiplex Portuguese families segregating for bipolar disorder, by use of a high-density single-nucleotide polymorphism (SNP) genotyping assay, the GeneChip Human Mapping 10K Array (HMA10K). Of these families, 12 were used for a direct comparison of the HMA10K with the traditional 10-cM microsatellite marker set and the more dense 4-cM marker set. This comparative analysis indicated the presence of significant linkage peaks in the SNP assay in chromosomal regions characterized by poor coverage and low information content on the microsatellite assays. The HMA10K provided consistently high information and enhanced coverage throughout these regions. Across the entire genome, the HMA10K had an average information content of 0.842 with 0.21-Mb intermarker spacing. In the 12-family set, the HMA10K-based analysis detected two chromosomal regions with genomewide significant linkage on chromosomes 6q22 and 11p11; both regions had failed to meet this strict threshold with the microsatelliteassays. The full 25-family collection further strengthened the findings on chromosome 6q22, achieving genomewide significance with a maximum nonparametric linkage (NPL) score of 4.20 and a maximum LOD score of 3.56 at position 125.8 Mb. In addition to this highly significant finding, several other regions of suggestive linkage have also been identified in the 25-family data set, including two regions on chromosome 2 (57 Mb, NPL = 2.98; 145 Mb, NPL = 3.09), as well as regions on chromosomes 4 (91 Mb, NPL = 2.97), 16 (20 Mb, NPL = 2.89), and 20 (60 Mb, NPL = 2.99).We conclude that at least some of the linkage peaks we have identified may have been largely undetected in previous whole-genome scans for bipolar disorder because of insufficient coverage or information content, particularly on chromosomes 6q22 and 11p11.
• Figure 2 Linkage signals obtained with 10-cM spaced and 4-cM spaced microsatellite assays, as well as the HMA10K SNP genotyping assay. These assays were performed on the same individuals from each of the same 12 families. Note the high correlation of the different assays in general, and that for both chromosomes 6 and 11, the SNP assay detected major linkage peaks at locations where the information content and coverage of the microsatellite panels were relatively low. Mb, megabaseposition; MSM, microsatellitemarkers.
• Figure 3 NPL analysis of 25 families with bipolar disorder from the Portuguese Island Collection. The number of each chromosome is shown at the top of each plot. The X-axis indicates the physical position (Mb) of the SNP marker. The Y-axis indicates the NPL Z score (black) or Kong and Cox LOD score (gray). For this scan, the empirical limit for genomewide significance was an NPL score of 3.85 and a LOD score of 3.15. Note that only the peak on chromosome 6 at 125.8 Mb was significant when both NPL Z and LOD thresholds were used.
Figure 4 Comparison of the 12-family (gray) and 25-family (black)genomewide linkage scans for selected
chromosomes showing suggestive or
significant linkage (see table 1). The X-axis indicates physical position (Mb). Notethat for both scans,
the signal on chromosome 6 at
position 125.8 Mb is the only genomic
region that achievesgenomewide
significance (of NPLscore and/or LOD
score).
• Another Approach To LD Analysis(“Family-Based Study”)
1. Haplotype relative risk (HRR) method
: Falk and Rubinstein (1987)
2. Haplotype-based haplotype relative risk (HHRR) method: Terwilliger and Ott (1992)
3. Transmission/ disequilibrium test (TDT)
: Spielman et al. (1993)
4. Sib-Transmission/ disequilibrium test (S-TDT): Spielman and Ewens (1998)
• Transmission/ disequilibrium test
1 2 1 2
1 1 0(d)0 (c)Allele2 (A2)
2(b)0 (a)Allele 1 (A1)Transm
itted
Allele2 (A2)
Allele1 (A1)
Not transmitted
- Focus on heterozygous parents only, and allow the use of multiple affected siblings.- McNemar’s test (standard χ2 test) H0: b = cThe TDT statistic:
- Powerful only in the presence of LD. cbcb+−
=2
21
)(χ
1. Study design 1. Select target disease
2. Case-control criteria
3. Determine # of samples
2. Sample and Data Collection 1. Genetic materials
2. Clinical information/phenotypic classification
3. Environmental Information
3. Genotyping1. Select candidate genes/SNP
2. Whole genome screening
3. Select appropriate method of genotyping
4. Statistical Analysis
SNP Association Study
• Statistical analysis scheme of SNP Genotyping Data
ex) 한 test 에서 유의수준이 인 test가 있다고 하자.
일반적으로 를
multiple comparison을 한다면
overall 는 0.05가 아니라 0.1855가 되므로 type I error가
Inflate 되었다.
α
∴∴∴
01 1 01 01
02 2 02 02
0 0 0 01 02
01
Let : 0, Pr(do not reject H H is true) 1
: 0, Pr(do not reject H H is true) 1
then Pr(do not reject ) where and
Pr(do not reject H and do no
H
H
H H H H H
α α
α α
= = −
= = −
=
= 02 0
2
t reject )
(1- ) (1- ) (1- )
H H
α α α= =
1 2 3 0kα α α α= = = ⋅⋅⋅⋅ = =
4
(1 ) (1 )1 0.1855 0.8145 ( .95) .95
kα α− ≤ −
− = = ≤∴ α
Multiple Comparisons
Bonferroni Correction : 만약 m개의 multiple comparison을 한다
면 각각의 유의수준을 로 하면 전체의 유의수준을 에 가
깝게 할 수 있다.
예)m이 4인 경우
응용) 10개의 mean을 비교하는 경우
p값의 기준을 0.05로 하면 overall p값을 유지할 수 없으므로 각각
의 경우 를 기준으로 test를 실시한다.
이를 “Bonferroni corrected p-value”라고 한다.
mα α
40.05(1 ) 0.95 1 0.054
− ≅ = −
0.05 0.00510
=
Multiple Comparisons: Bonferroni Correction
FDR = False Positive / Total Positive
1. Order p-values (largest to smallest)
2. Test 0.05 k/N, k=N, N-1, …. , 1
• Sequentially reduce error rate > power reduced much less
• Bonferroni, too conservative ; FDR helpful
Multiple Comparisons: FDR
False Discovery Rate
1. Order p-values by P(1), P(2),…., P(m)
2. Find the largest k such that
3. 1,2,…,k 까지는 유의하다.
(예) m=500K ,
0.05/500K =10^(-7) : Bonferroni correction
0.05/500K * 2,
0.05/500K * 3 …… 해서
P(2000) <2000*10^(-7) 이고 P(2001)>2001*10^(-7) 이라면 2000개 뽑는다.
Multiple Comparisons: FDR (independent test)
Benjamini and Hochberg (1995)
1. Order p-values by P(1), P(2),…., P(m)
2. Find the largest k such that
3. 1,2,…,k 까지는 유의하다.
If tests are indep or positively correlated then
If tests are negatively correlated then
Multiple Comparisons: FDR (dependent test)
Benjamini and Yekutieli (2001)
Permutation test
Five steps to a permutation test (Good 1994)1. Analyze the problem—identify the hypothesis and the
alternative (s) of interest.2. Choose a statistic.3. Compute the test statistic for the original observations.4. Generate the null reference distribution by- rearranging the labels in a manner consistent with the
randomization procedure- compute the test statistics- repeat these two steps until you obtain the distribution of the
test statistic for all possible rearrangements.5. Accept or reject the hypothesis using this permutation
distribution as a guide.• test statistic based on:• a) the actual observations (Fisher-Pitman)• b) their ranks
A numerical Example• Gene Expression data for two groups• Test statistic=t statistics (comparing two means).
….….….
-1.192,3,4,-3,1,0, -2,-1P1
4.380,-1,-2,-31,2,3,4원자료
T21
One-sided p-value
Statistical Models for SNP Association Study
Chi-square test
Fisher’s Exact test
Logistic regression
2 groups
2 groups (N<5 per group)
보정변수
이항변수 (case-control)
T-test
Wilcoxon test
ANOVA
ANCOVA, regression
2 groups
2 groups (N<5 per group)
3 groups or more
보정변수
연속변수 (BMI, BP, etc)
Statistical MethodsGroupResponse Var
Logistic regression for SNP Association Study
• 진단의 정확성, control 군 선정의 문제
• SNP, Haplotye, Haplotype pair
• Genetic model (ex. Additive or dominant )
• Small cell size에 주의
• Haplotype estimation의 uncertainty
Two-Stage Genotyping Designs for Genome-Wide Association Scans
• Optimal Two-stage Genotyping Design for Genome-Wide Association Scans, Wang et al. Genetic Epidemiology (2006)
• All SNPs are genotyped in the first stage in a fraction of samples.
• A liberal significance level threshold is used to identify a subset of SNPs with putative associations.
• In later stages, these putative associations are re-tested in a separate sample.
Cost Function
Expected genotype cost of the overall study : t1n1m + t2n2m2
Expected m2 = [(m-T)α1+ T(1-β1)] t1 and t2 be the per-genotype costm, m2 : number of markersT : number of true causal SNP
The goal is to find the minimum expected cost and the corresponding parameter n1, c1, n2, c2 (thus α1, β1, α2, β2), subject to the two constraints (1) and (2).
Public DB
SNP DB (NCBI)http://www.ncbi.nlm.nih.gov/projects/SNP/
http://geneticassociationdb.nih.gov/
Hapmap Projecthttp://www.hapmap.org
KSNP databasehttp://www.ngri.go.kr/SNP
SAS Genetics 소개 및 구현 예
data markers;input (a1-a10) ($);datalines;
B B A B B B A A B BA A B B A B A B C CB B A A B B B B A CA B A B A B A B A BA A A B A B B B C CB B A A A B A B C CA B B B A B A A A BA B A A A A A A A AB B A A A A A B B BA B A B A B B B A CA A A B A A A B B CB B A B A B A B A CA B B B A A A B A CB B B B A A A A A BA B A A A B A A C CA B A A A B A B C CB B A A A A A B A AA A A B A A A B A BA B A A A A B B C CA A A A A A A A B BA B B B A A A A C CA B A B A B A A B BB B A B A B A A A CA B A A A B A B A CA B B B B B A B B B;
proc allele data=markers outstat=ld prefix=Markerperms=10000 boot=1000 seed=123;
var a1-a10;run;
proc print data=ld;run;
25명에 대해서 5개 marker
Input, output datasets
HW Exact test를 위한Permutation
HWD 계수계산을위하여
Marker Summary
Number Number
of of Hetero- Allelic
Locus Indiv Alleles PIC zygosity Diversity
Marker1 25 2 0.3714 0.4800 0.4928
Marker2 25 2 0.3685 0.3600 0.4872
Marker3 25 2 0.3546 0.4800 0.4608
Marker4 25 2 0.3648 0.4800 0.4800
Marker5 25 3 0.5817 0.4400 0.6552
Marker Summary
--------------Test for HWE--------------
Chi- Pr > Prob
Locus Square DF ChiSq Exact
Marker1 0.0169 1 0.8967 1.0000
Marker2 1.7041 1 0.1918 0.2262
Marker3 0.0434 1 0.8350 1.0000
Marker4 0.0000 1 1.0000 1.0000
Marker5 9.3537 3 0.0249 0.0106
Polymorphism information Contents 부모로부터 자손에 전달되는 대립유전자를구별해낼 수 있는 확률
이형접합 개체의 비율
HWE 유지시 이형접합개체의 예상되는 비율
Allele Frequencies
Standard 95% Confidence
Locus Allele Frequency Error Limits
Marker1 A 0.4400 0.0711 0.3000 0.5800
Marker1 B 0.5600 0.0711 0.4200 0.7000
Marker2 A 0.5800 0.0784 0.4200 0.7400
Marker2 B 0.4200 0.0784 0.2600 0.5800
Marker3 A 0.6400 0.0665 0.5200 0.7600
Marker3 B 0.3600 0.0665 0.2400 0.4800
Marker4 A 0.6000 0.0693 0.4600 0.7400
Marker4 B 0.4000 0.0693 0.2600 0.5400
Marker5 A 0.2800 0.0637 0.1400 0.4200
Marker5 B 0.3000 0.0800 0.1600 0.4600
Marker5 C 0.4200 0.0833 0.2800 0.6000
Genotype Frequencies
HWD Standard 95% Confidence
Locus Genotype Frequency Coeff Error Limits
Marker1 A/A 0.2000 0.0064 0.0493 -0.0916 0.0956
Marker1 A/B 0.4800 0.0064 0.0493 -0.0916 0.0956
Marker1 B/B 0.3200 0.0064 0.0493 -0.0916 0.0956
Marker2 A/A 0.4000 0.0636 0.0477 -0.0336 0.1484
Marker2 A/B 0.3600 0.0636 0.0477 -0.0336 0.1484
Marker2 B/B 0.2400 0.0636 0.0477 -0.0336 0.1484
Marker3 A/A 0.4000 -0.0096 0.0457 -0.1044 0.0800
Marker3 A/B 0.4800 -0.0096 0.0457 -0.1044 0.0800
Marker3 B/B 0.1200 -0.0096 0.0457 -0.1044 0.0800
Marker4 A/A 0.3600 0.0000 0.0480 -0.0916 0.0864
Marker4 A/B 0.4800 0.0000 0.0480 -0.0916 0.0864
Marker4 B/B 0.1600 0.0000 0.0480 -0.0916 0.0864
Marker5 A/A 0.0800 0.0016 0.0405 -0.0756 0.0816
Marker5 A/B 0.1600 0.0040 0.0337 -0.0664 0.0636
Marker5 A/C 0.2400 -0.0024 0.0380 -0.0736 0.0680
Marker5 B/B 0.2000 0.1100 0.0445 0.0144 0.1884
Marker5 B/C 0.0400 0.1060 0.0282 0.0440 0.1564
Marker5 C/C 0.2800 0.1036 0.0453 0.0096 0.1884
data markers;
input (g1-g5) ($);
datalines;
B/B A/B B/B A/A B/B
A/A B/B A/B A/B C/C
B/B A/A B/B B/B A/C
A/B A/B A/B A/B A/B
A/A A/B A/B B/B C/C
B/B A/A A/B A/B C/C
A/B B/B A/B A/A A/B
A/B A/A A/A A/A A/A
B/B A/A A/A A/B B/B
……
proc allele data=markers outstat=ld prefix=Marker
perms=10000 boot=1000 seed=123
genocol delimiter='/';
var g1-g5;
run;
data snps;
input s1-s10;
datalines;
2 2 2 1 2 1 1 1 2 2
2 2 2 2 2 1 1 1 2 2
2 2 2 2 2 1 2 1 2 2
2 2 2 2 . . 1 1 2 2
2 2 2 2 1 2 1 2 2 2
2 2 2 2 . . 2 1 2 2
2 2 2 2 2 1 2 1 2 2
2 2 2 2 . . 2 1 2 2
2 2 2 2 1 1 1 1 2 2
2 2 1 1 2 2 2 1 2 2
2 2 2 1 2 2 2 1 2 2
2 2 2 2 1 1 1 1 2 2
2 2 2 1 2 2 2 2 2 2
2 2 2 2 2 2 1 1 2 2
2 2 2 2 2 1 2 1 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 1 2 2
proc allele data=snps prefix=SNP
nofreq haplo=est corrcoeff dprime yulesq;
var s1-s10;
run;
The ALLELE Procedure
Marker Summary
Number Number
of of Hetero- Allelic
Locus Indiv Alleles PIC zygosity Diversity
SNP1 44 1 0.0000 0.0000 0.0000
SNP2 44 2 0.1190 0.0909 0.1271
SNP3 41 2 0.3283 0.4390 0.4140
SNP4 43 2 0.3728 0.4884 0.4957
SNP5 44 1 0.0000 0.0000 0.0000
Marker Summary
---------Test for HWE---------
Chi- Pr >
Locus Square DF ChiSq
SNP1 0.0000 0 .
SNP2 3.5627 1 0.0591
SNP3 0.1493 1 0.6992
SNP4 0.0093 1 0.9231
SNP5 0.0000 0 .
Linkage Disequilibrium Measures
LD Corr
Locus1 Locus2 Haplotype Frequency Coeff Coeff
SNP1 SNP2 2-1 0.0682 -0.0000 .
SNP1 SNP2 2-2 0.9318 -0.0000 .
SNP1 SNP3 2-1 0.2927 -0.0000 .
SNP1 SNP3 2-2 0.7073 -0.0000 .
SNP1 SNP4 2-1 0.5465 -0.0000 .
SNP1 SNP4 2-2 0.4535 -0.0000 .
SNP1 SNP5 2-2 1.0000 0.0000 .
SNP2 SNP3 1-2 0.0732 0.0214 0.1807
SNP2 SNP3 2-1 0.2927 0.0214 0.1807
SNP2 SNP3 2-2 0.6341 -0.0214 -0.1807
SNP2 SNP4 1-1 0.0331 -0.0050 -0.0398
SNP2 SNP4 1-2 0.0367 0.0050 0.0398
SNP2 SNP4 2-1 0.5134 0.0050 0.0398
Lewontin's Yule's
Locus1 Locus2 D' Q
SNP1 SNP2 . .
SNP1 SNP2 . .
SNP1 SNP3 . .
SNP1 SNP3 . .
SNP1 SNP4 . .
SNP1 SNP4 . .
SNP1 SNP5 . .
SNP2 SNP3 1.0000 1.0000
SNP2 SNP3 1.0000 1.0000
SNP2 SNP3 -1.0000 -1.0000
SNP2 SNP4 -0.1322 -0.1546
SNP2 SNP4 0.1322 0.1546
SNP2 SNP4 0.1322 0.1546
Linkage Disequilibrium Measures
LD Corr
Locus1 Locus2 Haplotype Frequency Coeff Coeff
SNP2 SNP4 2-2 0.4168 -0.0050 -0.0398
SNP2 SNP5 1-2 0.0682 0.0000 .
SNP2 SNP5 2-2 0.9318 0.0000 .
SNP3 SNP4 1-1 0.2221 0.0608 0.2661
SNP3 SNP4 1-2 0.0779 -0.0608 -0.2661
SNP3 SNP4 2-1 0.3154 -0.0608 -0.2661
SNP3 SNP4 2-2 0.3846 0.0608 0.2661
SNP3 SNP5 1-2 0.2927 0.0000 .
SNP3 SNP5 2-2 0.7073 0.0000 .
SNP4 SNP5 1-2 0.5465 0.0000 .
SNP4 SNP5 2-2 0.4535 0.0000 .
Linkage Disequilibrium Measures
Lewontin's Yule's
Locus1 Locus2 D' Q
SNP2 SNP4 -0.1322 -0.1546
SNP2 SNP5 . .
SNP2 SNP5 . .
SNP3 SNP4 0.4382 0.5529
SNP3 SNP4 -0.4382 -0.5529
SNP3 SNP4 -0.4382 -0.5529
SNP3 SNP4 0.4382 0.5529
SNP3 SNP5 . .
SNP3 SNP5 . .
SNP4 SNP5 . .
SNP4 SNP5 . .
data markers;
input (m1-m8) ($);
datalines;
B B A B B B A A
A A B B A B A B
B B A A B B B B
A B A B A B A B
A A A B A B B B
B B A A A B A B
A B B B A B A A
A B A A A A A A
B B A A A A A B
A B A B A B B B
A B A B A B A A
B B A B A B A A
A B A A A B A B
A B B B B B A B
A A A B A A A B
B B A B A B A B
A B B B A A A B
B B B B A A A A
A B A A A B A A
A B A A A B A B
B B A A A A A B
A A A B A A A B
A B A A A A B B
A A A A A A A A
A B B B A A A A
;
proc haplotype data=markers out=hapout
init=random prefix=SNP;
var m1-m8;
run;
The HAPLOTYPE Procedure
Analysis Information
Loci Used SNP1 SNP2 SNP3 SNP4
Number of Individuals 25
Number of Starts 1
Convergence Criterion 0.00001
Iterations Checked for Conv. 1
Maximum Number of Iterations 100
Number of Iterations Used 24
Log Likelihood -97.62955
Initialization Method Random
Random Number Seed 520781000
Standard Error Method Binomial
Haplotype Frequency Cutoff 0
Algorithm converged.
Haplotype Frequencies
Standard 95% Confidence
Number Haplotype Freq Error Limits
1 A-A-A-A 0.16312 0.05278 0.05967 0.26657
2 A-A-A-B 0.02642 0.02291 0.00000 0.07132
3 A-A-B-A 0.00000 0.00000 0.00000 0.00001
4 A-A-B-B 0.02655 0.02297 0.00000 0.07157
5 A-B-A-A 0.02942 0.02414 0.00000 0.07673
6 A-B-A-B 0.12429 0.04713 0.03192 0.21667
7 A-B-B-A 0.06964 0.03636 0.00000 0.14091
8 A-B-B-B 0.00056 0.00339 0.00000 0.00720
9 B-A-A-A 0.09444 0.04178 0.01256 0.17632
10 B-A-A-B 0.07297 0.03715 0.00015 0.14579
11 B-A-B-A 0.07333 0.03724 0.00034 0.14632
12 B-A-B-B 0.12317 0.04695 0.03116 0.21519
13 B-B-A-A 0.12935 0.04794 0.03539 0.22331
14 B-B-A-B 0.00000 0.00000 0.00000 0.00000
15 B-B-B-A 0.04071 0.02823 0.00000 0.09603
16 B-B-B-B 0.02603 0.02275 0.00000 0.07061
proc print data=hapout noobs round; run;
_ID_ m1 m2 m3 m4 m5 m6 m7 m8 HAPLOTYPE1 HAPLOTYPE2 PROB
1 B B A B B B A A B-A-B-A B-B-B-A 1.00
2 A A B B A B A B A-B-A-A A-B-B-B 0.00
2 A A B B A B A B A-B-A-B A-B-B-A 1.00
3 B B A A B B B B B-A-B-B B-A-B-B 1.00
4 A B A B A B A B A-A-A-A B-B-B-B 0.16
4 A B A B A B A B A-A-A-B B-B-B-A 0.04
4 A B A B A B A B A-A-B-B B-B-A-A 0.13
4 A B A B A B A B A-B-A-A B-A-B-B 0.14
4 A B A B A B A B A-B-A-B B-A-B-A 0.34
4 A B A B A B A B A-B-B-A B-A-A-B 0.19
4 A B A B A B A B A-B-B-B B-A-A-A 0.00
5 A A A B A B B B A-A-A-B A-B-B-B 0.00
5 A A A B A B B B A-A-B-B A-B-A-B 1.00
6 B B A A A B A B B-A-A-A B-A-B-B 0.68
6 B B A A A B A B B-A-A-B B-A-B-A 0.32
7 A B B B A B A A A-B-A-A B-B-B-A 0.12
7 A B B B A B A A A-B-B-A B-B-A-A 0.88
8 A B A A A A A A A-A-A-A B-A-A-A 1.00
9 B B A A A A A B B-A-A-A B-A-A-B 1.00
10 A B A B A B B B A-A-A-B B-B-B-B 0.04
10 A B A B A B B B A-B-A-B B-A-B-B 0.95
10 A B A B A B B B A-B-B-B B-A-A-B 0.00
11 A B A B A B A A A-A-A-A B-B-B-A 0.43
11 A B A B A B A A A-B-A-A B-A-B-A 0.14
11 A B A B A B A A A-B-B-A B-A-A-A 0.43
12 B B A B A B A A B-A-A-A B-B-B-A 0.29
12 B B A B A B A A B-A-B-A B-B-A-A 0.71
13 A B A A A B A B A-A-A-A B-A-B-B 0.82
13 A B A A A B A B A-A-A-B B-A-B-A 0.08
13 A B A A A B A B A-A-B-B B-A-A-A 0.10
14 A B B B B B A B A-B-B-A B-B-B-B 0.99
14 A B B B B B A B A-B-B-B B-B-B-A 0.01
15 A A A B A A A B A-A-A-A A-B-A-B 0.96
15 A A A B A A A B A-A-A-B A-B-A-A 0.04
16 B B A B A B A B B-A-A-A B-B-B-B 0.12
16 B B A B A B A B B-A-A-B B-B-B-A 0.14
16 B B A B A B A B B-A-B-B B-B-A-A 0.75
17 A B B B A A A B A-B-A-B B-B-A-A 1.00
18 B B B B A A A A B-B-A-A B-B-A-A 1.00
19 A B A A A B A A A-A-A-A B-A-B-A 1.00
20 A B A A A B A B A-A-A-A B-A-B-B 0.82
20 A B A A A B A B A-A-A-B B-A-B-A 0.08
20 A B A A A B A B A-A-B-B B-A-A-A 0.10
21 B B A A A A A B B-A-A-A B-A-A-B 1.00
22 A A A B A A A B A-A-A-A A-B-A-B 0.96
22 A A A B A A A B A-A-A-B A-B-A-A 0.04
23 A B A A A A B B A-A-A-B B-A-A-B 1.00
Haplotype Trend Regression (Zaykin)
data alleles;
input (a1-a6) ($) disease;datalines;
A a B B c C 1A A B b c C 1a A B b c c 0A A B B c C 1A A b B c C 1A A B b C c 0A a b B C c 1A A b B C c 1A a B B c c 1a a B b c c 0A A B B C C 1A A B B c c 1a A b b c c 0A A B B c c 1A A b b c c 0A A b B c C 0A A B b c C 1A a b B c c 1A a B B c C 1A A b b C C 0A A B B C C 1A A b B C c 1A A b B c C 1a A B b C c 0A a B B C C 0A A B B C c 1
A A B b C c 0A A B B c C 1a A B b C C 1A a B b C c 1A A B b c C 1A a B B c c 1A A B b C c 1a A B b C c 1A A B b C C 1A a B B C C 1a A B b C c 0a A b B C C 0A A B b c C 1a A B b c c 0A A B B C C 0A A B B c c 1A a B B C c 1;
proc haplotype data=alleles out=out outid;var a1-a6;trait disease;id disease;
run;
The HAPLOTYPE Procedure
Analysis Information
Loci Used M1 M2 M3
Number of Individuals 43
Number of Starts 1
Convergence Criterion 0.00001
Iterations Checked for Conv. 1
Maximum Number of Iterations 100
Number of Iterations Used 31
Log Likelihood -115.48338
Initialization Method Linkage Equilibrium
Standard Error Method Binomial
Haplotype Frequency Cutoff 0
Algorithm converged.
Haplotype Frequencies
Standard 95% Confidence
Number Haplotype Freq Error Limits
1 A-B-C 0.26144 0.04766 0.16802 0.35485
2 A-B-c 0.22524 0.04531 0.13643 0.31405
3 A-b-C 0.13806 0.03742 0.06473 0.21140
4 A-b-c 0.14270 0.03794 0.06834 0.21706
5 a-B-C 0.07662 0.02885 0.02007 0.13316
6 a-B-c 0.08787 0.03071 0.02768 0.14805
7 a-b-C 0.00063 0.00271 0.00000 0.00595
8 a-b-c 0.06745 0.02720 0.01413 0.12076
Test for Marker-Trait Association
Trait Trait Num Chi-
Number Value Obs DF LogLike Square
1 1 29 7 -68.11558
2 0 14 7 -37.28544
Combined 43 7 -115.48338 20.1647
Test for
Marker-Trait
Association
Pr >
ChiSq 0.0052
proc print data=out;
run;
OBS _ID_ disease a1 a2 a3 a4 a5 a6 HAPLOTYPE1 HAPLOTYPE2 PROB
1 1 1 A a B B c C A-B-C a-B-c 0.57103
2 1 1 A a B B c C A-B-c a-B-C 0.42897
3 2 1 A A B b c C A-B-C A-b-c 0.54538
4 2 1 A A B b c C A-B-c A-b-C 0.45462
5 3 0 a A B b c c A-B-c a-b-c 0.54783
6 3 0 a A B b c c A-b-c a-B-c 0.45217
7 4 1 A A B B c C A-B-C A-B-c 1.00000
8 5 1 A A b B c C A-B-C A-b-c 0.54538
9 5 1 A A b B c C A-B-c A-b-C 0.45462
10 6 0 A A B b C c A-B-C A-b-c 0.54538
11 6 0 A A B b C c A-B-c A-b-C 0.45462
12 7 1 A a b B C c A-B-C a-b-c 0.43177
13 7 1 A a b B C c A-B-c a-b-C 0.00346
14 7 1 A a b B C c A-b-C a-B-c 0.29706
data out1;set out;haplotype=tranwrd(haplotype1,'-','_');
data out2;set out;haplotype=tranwrd(haplotype2,'-','_');
data outnew;set out1 out2;
proc sort data=outnew; by haplotype;
run;
data outnew2;set outnew;lagh=lag(haplotype);if haplotype ne lagh then num+1;hapname=compress("H"||num,' ');
proc sort data=outnew2; by _id_ hapname;
run;
data outt;set outnew2;by _id_ haplotype;if first.haplotype then totprob=prob/2;else totprob+prob/2;if last.haplotype;
proc transpose data=outt out=outreg(drop=_NAME_) id hapname;idlabel haplotype;var totprob;by _id_ disease;
run;
data htr;set outreg;array h{8};do i=1 to 8;if h{i}=. then h{i}=0;
end;keep _id_ disease h1-h8;
proc print data=htr ;*noobs round label;run;
proc logistic data=htr descending;model disease = h1-h8 / selection=stepwise;
run;
Individual
ID disease A_B_C A_B_c a_B_C
1 1 0.29 0.21 0.21
2 1 0.27 0.23 0.00
3 0 0.00 0.27 0.00
4 1 0.50 0.50 0.00
5 1 0.27 0.23 0.00
6 0 0.27 0.23 0.00
7 1 0.22 0.00 0.13
8 1 0.27 0.23 0.00
9 1 0.00 0.50 0.00
10 0 0.00 0.00 0.00
11 1 1.00 0.00 0.00
12 1 0.00 1.00 0.00
13 0 0.00 0.00 0.00
14 1 0.00 1.00 0.00
15 0 0.00 0.00 0.00
16 0 0.27 0.23 0.00
17 1 0.27 0.23 0.00
18 1 0.00 0.27 0.00
19 1 0.29 0.21 0.21
20 0 0.00 0.00 0.00
21 1 1.00 0.00 0.00
22 1 0.27 0.23 0.00
23 1 0.27 0.23 0.00
24 0 0.22 0.00 0.13
25 0 0.50 0.00 0.50
26 1 0.50 0.50 0.00
27 0 0.27 0.23 0.00
28 1 0.50 0.50 0.00
29 1 0.01 0.00 0.49
a_B_c A_b_C A_b_c a_b_c a_b_C
0.29 0.00 0.00 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.23 0.00 0.23 0.27 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.15 0.15 0.13 0.22 0.00
0.00 0.23 0.27 0.00 0.00
0.50 0.00 0.00 0.00 0.00
0.50 0.00 0.00 0.50 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.50 0.50 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.00 1.00 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.23 0.00 0.23 0.27 0.00
0.29 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.15 0.15 0.13 0.22 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.23 0.27 0.00 0.00
0.00 0.00 0.00 0.00 0.00
0.00 0.49 0.00 0.00 0.01
H1 H2 H5 H6 H3 H4 H8 H7
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 6.1962 1 0.0128
Score 6.3995 1 0.0114
Wald 4.9675 1 0.0258
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 1.1986 0.4058 8.7224 0.0031
H8 1 -6.3249 2.8378 4.9675 0.0258