tams 23: statistiska metoder i bioinformatik föreläsning 6...
TRANSCRIPT
![Page 1: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/1.jpg)
TAMS 23: Statistiska metoder i bioinformatikFöreläsning 6: Modellering av DNA och Ord i
DNATimo Koski
Matematisk statistik /Linkopings tekniska hogskola
07/04/2003 – p.1/69
![Page 2: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/2.jpg)
Lecture 6
The lecture is base on portions of chapter 5 and onsection 10.4 of Ewens and Grant. The lecture is,however, as far as the occurrences (or appearances)of DNA words is concerned, a mix of the originalresearch contributions and the material in thetextbook.The sections 5.1 ’shotgun sequencing‘, section 5.4’long repeats‘, section 5.5 �-scans are offered aspossible examination items, ’presentationer’.
07/04/2003 – p.2/69
![Page 3: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/3.jpg)
Lecture 6: Contents
1) Weight Matrix Model
2) Markov Modelling
3) Words in a DNA sequence(A) The number of occurrences(B) The length between one occurrence of a word
and the next(C) The waiting time till appearance of a word
including the Markovian case.(D) p.g.f. for the waiting time till appearance of a
word.
07/04/2003 – p.3/69
![Page 4: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/4.jpg)
An Image of Practical Life Science
07/04/2003 – p.4/69
![Page 5: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/5.jpg)
Weight Matrix Model
A weight matrix � is a simple model often used by
molecular biologicists as a representation for a family of
signals. The sequences containing the signals are sup-
posed to have equal length (= �) and to have no gaps
(no positions are blank).
07/04/2003 – p.5/69
![Page 6: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/6.jpg)
Weight Matrix Model
A weight matrix
��� has as entries the probabilities �� ����
(e.g. observed relativefrequency) for that a string should have one of the bases
�� � �� � �� � �� ��� �� �� �� � �
at position
�
:
�� �
�� ���� � � � ��� ���
�� ���� � � � ��� �� �
�� ���� � � � � � �� �
�� ���� � � � ��� ���� �
The weight matrix model is often called a profile.
07/04/2003 – p.6/69
![Page 7: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/7.jpg)
Prob(Data
�
Model)
The probability of a finite sequence � � ��� ��� � � � �� �
given the model � is given by
� � � � � �
� � �
��� �
� �� � � ���� �� ��� ��
where the indicator
� ��� ��� � , a function of x, is
�
if � � �
�� � , i.e., if the symbol � �does not appear in position
!
in
the string � and is"
otherwise. Thus the bases in the
different positions are independent given � .
07/04/2003 – p.7/69
![Page 8: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/8.jpg)
Prob(Data
�
Model)
A sequence of strings � �� � � �� � �
is training data, i.e., ofknown cases of members of a signal family. We takethem to be generated independently given � , is bymultiplication of the preceding expressions assignedthe probability
� � � �� � � �� � � ��
� ��
� � �� � � � �
�
�
�� ��
�� �� �� � � �� � �� ��
where � �� � � is the number of times the symbol � �ap-
pears on position
!
in
� �
.07/04/2003 – p.8/69
![Page 9: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/9.jpg)
Prob(Data
�
Model �� �)
� ��� �� � � �� � �� �� ��
��� � � � ��
��
� ���
� ���� ��� �� � � �
where �� �� �
is the number of times the symbol � � appears on position
�
in � �� � � �� � �
.The maximum likelihood estimate is
� �� �� � � �� ���� �
This will be shown during a later ’lektion’.
07/04/2003 – p.9/69
![Page 10: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/10.jpg)
Example: Promoter Regions
RNA polymerase molecules start transcription byrecognizing and binding to promoter regions upstreamof the desired transcription start sites. Unfortunatelypromoter regions do not follow a strict pattern. It ispossible to find a DNA sequence (called theconsensus sequence) to which all of them are verysimilar.
07/04/2003 – p.10/69
![Page 11: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/11.jpg)
Example: Promoter Regions
For example, the consensus sequence in the bac-
terium E. Coli, based on the study of 263 promoters,
is TTGACA followed by 17 random base pairs followed
by TATAAT, with the latter located about 10 bases up-
stream of the transcription start site. None of the 263
promoter sites exactly match the above consensus se-
quence.
07/04/2003 – p.11/69
![Page 12: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/12.jpg)
Weight Matrix Model
By constructing the weight matrix of the TATAAT region (n=6) we can compute theprobability of a DNA sequence � � �� � �� � � � � �� � and compute the probability
��� � � � � � � � � � � �� � � This means that successive bases are thought as being generated independently fromthe distributions in the weight matrix table. Similarly, we can compute using, e.g., aweight matrix model of the signal family
��� � � � � � �� �� � � � � � � � �
with weight matrix from a non-promoter region and compare��� � � � � � � � � � � � �
��� � � � � � �� �� � � � � � � � � �
to decide in � is a member of the family.
07/04/2003 – p.12/69
![Page 13: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/13.jpg)
Markov Model �
A Markov model is a statistical ’counter hypothesis‘ to
weight models.
A C G TA � � � � � � �� � � � � � � ��
C �� � � �� �� �� � � � � ��
G � � � � � � �� � � � � � � ��
T �� � � �� �� �� � � �� ��
The
matrix
�
contains
"� � unknown probabilities that will
have to be learned from training data, i.e., of knowncases of member of a signal family.
07/04/2003 – p.13/69
![Page 14: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/14.jpg)
Prob
� � � � � � �
If � � � �� � � � �� � �� � �� �� �where
�� � � � �� � � � � � � �� � �� � � � �
then
� �� � � ��� � � � �� � � � �� � � � �� � �� � �� �� ��
�� ��� � �� � �
07/04/2003 – p.14/69
![Page 15: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/15.jpg)
An Image of Practical Life Science
07/04/2003 – p.15/69
![Page 16: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/16.jpg)
Learning with Markov chains
��
������������
�� �� �� �� �� �� �� ��
�� �� �� �� �� �� �� ��
�� �� �� �� �� �� �� ��
�� �� � �� �� �� �� ��
������������
We change the interpretation: the function � � � � � isregarded as a function of
�(or the probabilities in
�
)and called a likelihood function and denoted by
� � �
� � � � ��� �
�� � �
� �� � � �� �
The maximum likelihood estimate of
�
is obtained by
maximizing
� �
as a function of
�
. 07/04/2003 – p.16/69
![Page 17: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/17.jpg)
Learning with Markov chains
The maximum likelihood estimate
�� �� �of � �� �is
�� �� � �� �� �
� �� for all
!
and�
.
Here � �� �is the number of times the sequencecontains the pair of bases
� !� � (in this order), i.e., the
number of transitions from!
to
�
and � �is the numberof times the base
!
occurs in the sequence.
This will be shown during a later ’lektion’.
07/04/2003 – p.17/69
![Page 18: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/18.jpg)
Modelling with Markov chains
�� �� � �� �� �
� �� for all
!
and�
.
Here � �� �is the number of times the sequencecontains the pair of bases
� !� � (in this order), i.e., the
number of transitions from!
to�
and � �is the numberof times the base
!
occurs in the sequence.Hence Markov models assume that there is biologicalinformation contained in the frequency of pairs ofbases following each other.
07/04/2003 – p.18/69
![Page 19: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/19.jpg)
An Image of Practical Life Science
07/04/2003 – p.19/69
![Page 20: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/20.jpg)
Learning & Parametric Inference
In maximum likelihood estimation we have regardedthe transition probabilities as parameters and usedtraining data to infer their values.Inference: the process of deriving a conclusion fromfrom fact and/or premise.In probabilistic modelling of sequences the facts arethe observed sequences, the premise is representedby the model and the conclusions concernunobserved quantities.
07/04/2003 – p.20/69
![Page 21: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/21.jpg)
Another Image of Practical LifeScience
07/04/2003 – p.21/69
![Page 22: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/22.jpg)
Words in a DNA sequence
Word means here a subsequence of a larger se-
quence. This is a unique subsequence, not a family
of signals.
07/04/2003 – p.22/69
![Page 23: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/23.jpg)
Words in A DNA sequence
Frequency of occurrence of specific short nucleotidewords (like GAGA) within part or all of a nucleotidesequence is of interest for several biological reasons.
The occurrence of a word in unexpectedly large num-
bers in a segment may provide information about struc-
ture or function.
07/04/2003 – p.23/69
![Page 24: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/24.jpg)
Words in a DNA sequence
Information about structure or function:
� a segment of DNA may have originally beenformed by replication of smaller segments. Suchreplication might be exact initially but might bealtered over time. But in regions where retention offunction is necessary, exact repeats would beexpected to occur.
� the occurrence of of repeated subsequences mayidentify locations where an insert has beenintroduced into the DNA sequence.
07/04/2003 – p.24/69
![Page 25: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/25.jpg)
DNA Words : statistical assumptions
Let
�
assume values in
�� � �� �� � �
and let� � � � "�
.
�
is a nucleotide chosen at random. Hence the probability ofthe occurrence of a word of length
is
� �"
�
07/04/2003 – p.25/69
![Page 26: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/26.jpg)
DNA Words : statistical assumptions
The DNA dice is biologically unreasonable as a model
for a whole genome, but can serve well as a statisti-
cal null hypothesis to detect important deviations from
random noise.
07/04/2003 – p.26/69
![Page 27: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/27.jpg)
(A) Occurrence of a word
We say that ’a word occurs at position!
‘ (Like GAGAin position 7 in the figure) if it is found to begin atposition
!
in a longer sequence.
A T T C C A G A G A A T G G C G A C A
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
❈
07/04/2003 – p.27/69
![Page 28: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/28.jpg)
(A) Occurrence of a word
Let the random variable
� � �
be the number ofoccurrences of a nucleotide word (like GAGA) oflength
(like
�
) within a nucleotide sequence oflength ,
�
. Then � � � � "is the maximum
value achievable by
� � �
.
07/04/2003 – p.28/69
![Page 29: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/29.jpg)
Occurrence of a word: overlapcapability (examples)
The word ACAC cannot occur more than 9 times in a sequence of length 20, because of
its ’overlap capability‘.
ACACACACACACACACACAC
07/04/2003 – p.29/69
![Page 30: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/30.jpg)
Occurrence of a word: overlapcapability (examples)
The word AAAA can occur between 0 and 17 times ina sequence of length 20, because of its greater’overlap capability‘. The word ACGT has no overlapcapability except in the trivial case.
07/04/2003 – p.30/69
![Page 31: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/31.jpg)
Occurrence of a word
Let the random variable
� � �
be the number ofoccurrence of a nucleotide word of length
within anucleotide sequence of length . is thought to beso large that end effects can be neglected. Let itsprobability function be denoted by
��� � � � � ���� � � � �
This does depend on the overlap capability (’ �’ (to
be defined)) and is therefore not a binomial probabil-
ity function.
07/04/2003 – p.31/69
![Page 32: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/32.jpg)
Overlap capability
We define the overlap capability of a word as follows.Let � be a word, � � � � � � � � �, so that � �are itsnucleotides read from left to right.The overlap capability � is a binary sequence
� � � � � � � � �, where � �is defined as follows:
� � � "
if it is possible for the word’s first
!
letters tooverlap its last
!
letters (in the same order) and � � � �
otherwise.Overlap capability is also known as the autocorrelationof a word.
07/04/2003 – p.32/69
![Page 33: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/33.jpg)
Overlap capability
The overlap capability � of � � � � � � � � � is a binarysequence � � � � � � � � �, where � �is defined as follows:
� � � "
, if it is possible for the word’s first
!letters to be
equal to its last
!
letters (in the same order) and � � � �
otherwise.More formally
� � �
"
if �� � � � �� � �� � � "� � � �� !
�elsewhere.
07/04/2003 – p.33/69
![Page 34: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/34.jpg)
Overlap capability: examples (Ewens& Grant p. 169)
�
� � �
"
if �� � � �� � �� � � "� � � �� !
�
elsewhere.
� � � � �� � GAGA
� " � "GGGG
" " " "GAAG
" � � "
GAGC
� � � "
07/04/2003 – p.34/69
![Page 35: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/35.jpg)
Number of Occurrences of a Word
We are not going to derive the probability function
�� � � � � �� � � � � for
� � �
, the number of occurrence of a nucleotideword of length
within a nucleotide sequence oflength . Instead we are going find
� � � � � � �
� ���� � � � � �
07/04/2003 – p.35/69
![Page 36: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/36.jpg)
� � � � �
, Expected Number ofOccurrences of a Word
Let �� � � � �
and
�� �� � � �� �
�� ����
��
if the word begins in position�
otherwise.
Then
� � � � �� � �� � � � � ���
and
� � � � � � � � �� � � � �� � � � � � � �� �
��
� ����� � � �
the the word begins in position
� � �� �� � � � ��
� ��
07/04/2003 – p.36/69
![Page 37: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/37.jpg)
� � � � �
, Expected Number ofOccurrences of a Word
The expected number of occurrences of a word
� � � � � � � �
"
��
does not depend on the overlap capability !
07/04/2003 – p.37/69
![Page 38: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/38.jpg)
�� � � � � �
, variance of the Numberof Occurrences of a Word
���� � � � � � ��
�� ��
� � ���� � � � �� �
��
�� ��
� � �� � � �� � �
��� �
� � � ��
� � �� � � �
��
�� ��
�� �� � � �� � � �
�
"
� �
07/04/2003 – p.38/69
![Page 39: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/39.jpg)
�� � � � �
The correlations between the indicators� �and
� �thatare near neighbors depend on �.
� � � �� ��� is just the probability that the word occcurs at both po-
sition
!
and position
! � �.
07/04/2003 – p.39/69
![Page 40: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/40.jpg)
�� � � � �
� If
� � � � � ��� � � "� � � "
, then
� � � �� ��� � � � � �
"
� ��
� If � �� � � "� � � " � � � �, then � is irrelevant, and
� � � �� ��� �"
� �
07/04/2003 – p.40/69
![Page 41: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/41.jpg)
Counting the cases in
�� � �
�� � �
�� � � � �
� There are � terms among the ��
in� �� � � � � � � � � �� � such that
! � �
.
� If � � "
, there are
� �� � �
terms such that
� � ! � �
or
! � � � �
for
� � "� � � �� � ��� � � "� � � "
.
� If � �
, there are
� � � �
� � � � � �� � �� � � "
remaining terms that do not depend on �.
07/04/2003 – p.41/69
![Page 42: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/42.jpg)
�� � �
�� � �
�� � � � �
��� �
�� � �
� � � �� �
� � � � � � ��
�� � � � � � " � � � �
�
� ��� � � � � � � � � �
� � �
�� � � � � � �
"
�
07/04/2003 – p.42/69
![Page 43: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/43.jpg)
�� � � � � �
, variance of the Numberof Occurrences of a Word
���� � � � � �
� � � �� " � � � �
� � ��
�� � �� � � " � � � �
�
� ��� � � � � � � � � �
� � �
�� � � � � � �
"
��
In the right hand side the second term equals
�
if � �
and the third term equals
�
if � � "
. If
� "
, the third
term in the equation equals zero, and the variance is
that of a binomial R.V..
07/04/2003 – p.43/69
![Page 44: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/44.jpg)
�� � � � � �
, generalization
Let the the alphabet is generic (
�
symbols) and the
probabilities be � �� � � � ��� , and � be a word with
letters
with probabilities � ��� , � � "� � � �� . Let
�� � �� � � � ���
be the product of probabilities of the first
�
letters in �.
Then we insert in the preceding formula
�� for
� � �
�
and
�� for
� � ��
.
07/04/2003 – p.44/69
![Page 45: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/45.jpg)
Reference
The results about
� � � � � �
and
���� � � � � �presented
above are due toJ. F. Gentleman and R. C. Mullin: The Distribution ofthe Frequency of Occurrence of NucleotideSubsequences, Based on their Overlap Capability.Biometrics, 45, pp. 35 � 52, 1989.Gentleman and Mullin find also the full probabilitydistribution
��� � � � � ���� � � � �
07/04/2003 – p.45/69
![Page 46: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/46.jpg)
(B) The length between oneoccurrence of a word and the next
07/04/2003 – p.46/69
![Page 47: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/47.jpg)
(B) The length between oneoccurrence of a word and the next
Let � be word, � � � � � � � � � with overlap capability �.
� � � � � � � � �.Let
� � be the distance to the next occurrence of � after
� has occurred at
!
.Let
�� � ��� ��� �� � � � � � � �
We are going to find a recursive formula for � � � ���
.
07/04/2003 – p.47/69
![Page 48: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/48.jpg)
Some pertinent events
Let
� � be the distance to the next occurrence of � after
� has occurred at
!
.
�� � ��� � � � � � � � �
Let
� � � occurs at
! � � but nowhere between
!
and
! � �
Then
� � � � �� � ��� .
07/04/2003 – p.48/69
![Page 49: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/49.jpg)
More pertinent events
�� � occurs at
� ��� but nowhere between
�
and� ��
and let
�� � occurs at
� ��
� � � � occurs at
� �� and
� � �
but nowhere between
�
and
� � �
Then
� � � � � � � �� � � � � � �� � �
07/04/2003 – p.49/69
![Page 50: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/50.jpg)
Decomposition of events
Decomposition
� � � � � � � �� � � � � � �� � �
yields, since the events are in the right hand side aredisjoint,
�� � ��� � � � � � � � � �� � �
�� �� �� � �
We shall now compute the probabilities in the right
hand side of this expression.
07/04/2003 – p.50/69
![Page 51: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/51.jpg)
Decomposition of events: the cases
� � ����� � � �
�� �
Let
" � � � � "
.
� � � � � � � �� � � � � � �� � �
�� � ��� � � � � � � � � �� � �
�� �� �� � �
07/04/2003 – p.51/69
![Page 52: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/52.jpg)
Decomposition of events: the cases
� � ����� � � �
�� �
,� �
i i+j
i+L
i+1 i+2 i+3
s s s s1 2 3 4
1s s2
s 3 s4
1 < j < y −1 < L−1
� � � � � � � �
"
�
07/04/2003 – p.52/69
![Page 53: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/53.jpg)
Decomposition of events: the cases
� � ����� � � �
�� �
,� � �
i i+j
i+L
i+1 i+2 i+3
s s s s1 2 3 4
1s s2
s 3 s4
1 < j < y −1 < L−1 For" � � � � � " � � "we have that
� �� � � �� � � � � � � � � �
"
� � ��
since the overlaps are to be taken into account.07/04/2003 – p.53/69
![Page 54: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/54.jpg)
��� �� � �
for � � ���� � � �
�� �
�� � ��� � � � � � � � � �� � �
�� �� �� ��
given for
" � � � � "
we have
�� � ��� � � � � �
"
��
� � �� � �
�� � � � � � � � � �
"
� � ��
07/04/2003 – p.54/69
![Page 55: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/55.jpg)
Decomposition of events: the cases
� �
,� �
Let
� � .
� � � � � � �"
�
07/04/2003 – p.55/69
![Page 56: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/56.jpg)
Probability
� � �
;
� � � ��
For
� � � � �
we have that
� �� � � � �� � � �
since the events that � occurs at! � � and
! � �
but
nowhere between
!
and
! � �
are in this case independent.
i+yi i+j i+y−3
i+(y−L)i+L
i+1 i+2 i+3
s s s s1 2 3 4 s1 s2 s3 s4
L < j < y−L
s1 s2s
3s4
07/04/2003 – p.56/69
![Page 57: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/57.jpg)
Probability
� � �;
� �� � � � � �
For � � � " � � � � � "
we have that
� �� � � �� � � � � � � � � �
"
� � ��
since for � � � " � � � � � "overlaps are to be taken
into account.
i+yi i+j i+y−3
i+(y−L)i+L
i+1 i+2 i+3
s s s s1 2 3 4 s1 s2 s3 s43
s2
ss1 4s
y−L+1 < j < y −107/04/2003 – p.57/69
![Page 58: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/58.jpg)
��� �� � �
for � �
Hence for
� � .
� ���� �
"
�
"
� � �� �
� �� � �
� � �
�� � � � �� �
� � � � � ��
"
� � ��
and the recursion of Ewens and Grant is complete.
07/04/2003 – p.58/69
![Page 59: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/59.jpg)
(C) The waiting time to the firstoccurrence of a word: beginning at
the origin
� ���� �
"
�
"
� � ��
� � � �� � �
�� � � � �� � � � � � ��
"
� � ��
This formula is due to Gunnar Blom and DanielThorburn (1982), as the probability function of the thewaiting time
�� until the word has appeared, in thesense that all the symbols have been seen, hence
�� � �� � �
for � �
.
07/04/2003 – p.59/69
![Page 60: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/60.jpg)
The waiting time of a word: TheMarkov Case
We consider a stationary Markov chain� �
�
�� � �
� � �� �� � � �� �
with the state space�
and let
� ��� � �
�� ��
� � � �� �
designate the probability that the chain should gener-
ate � � � � � � after � � � � .
07/04/2003 – p.60/69
![Page 61: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/61.jpg)
The Markov Case
We denote by
��� � � is the stationary probability ofobserving � so that
�� � � � � � � � �
��� �
�� � � �� � �
07/04/2003 – p.61/69
![Page 62: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/62.jpg)
The Markov Case: The recursion
The probability function of the waiting time of the word �� �� � � � � � � �� at time
�� �
ina sample path of
�� � �� � � , is for
� � �
� � � � ��� � � � � � ��where
� �� �
� � �� � � � � � � �� � � � � � � � � � � �
where � � � � � � is the probability of transition from �� to �� in � steps, and
� �� �
� �� �� �� � � � � � � � � � � � � � � � �� � �
Furthermore, � �� � for
� � �
and � � � � � � � � .
07/04/2003 – p.62/69
![Page 63: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/63.jpg)
The Markov Case: The recursion
The result can clearly be derived in the same way asthe result by Blom and Thorburn was derived aboveand is due toS. Robin and J.J. Daudin (1999): Exact distribution ofword occurrences in a random sequence of letters.Journal of Applied Probability, 36, pp. 179 � 193.
07/04/2003 – p.63/69
![Page 64: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/64.jpg)
The p.g.f. of the waiting time of aword
Gunnar Blom and Daniel Thorburn (1982) derived also
the p.g.f. of the waiting time of a word.
07/04/2003 – p.64/69
![Page 65: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/65.jpg)
The p.g.f. of the waiting time of aword
� ���� �
"
�
"
� � ��
� � � �� � �
�� � � � �� � � � � � ��
"
� � �
�
� ���� � " �
� � ��
� �� � �
� � �
�� � � � �� �
� � � � � ��
"
� � �
Here
" �� � �
� � ��� � � � �
�� � � � � � �
�� � � �
and we use this in the right hand side07/04/2003 – p.65/69
![Page 66: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/66.jpg)
The p.g.f. of the waiting time of aword
�
� ���� �
�
�� � � � �� �
� � � � � �
�� � � � �� �
� � � � � ��
"
� � �
Now we multiply this equality by � �
and sum over � :
� �� � �
��� �
� � � �
� � � �
� � ��
� �
� � � � �
�� � � � �� �
� � � � � ��
"� � � ��
07/04/2003 – p.66/69
![Page 67: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/67.jpg)
The p.g.f. of the time to the waitingtime of a word
� � �� � � � �
�� ��
� � � �� � �
�� � � � � � � � �
�� � �
� �
� � � � � ��� � � � �� � � �
� �� � �
�� � �
can be rewritten (lots of details omitted) in the form
� �� � �
� " � �
��� �
� �� � � " � � � �
� � �� �
� ��� �
� � � � �
"
��
�
� �� � �
� " � �
� " � � �� � � � �� �
� ��� �
� � � � �
"
��
07/04/2003 – p.67/69
![Page 68: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/68.jpg)
The p.g.f. of the time to the waitingtime of a word
� �� � �
� " � �
� " � � �� � � � �� �
� ��� �
� � � � �
"
��
If we solve this equation w.r.t. � � �� � we get �
����� � � ���
� �� � � �� � � � �
where
��� � � ��� � � �� �
(= the generating function of
the overlap capabilities).
07/04/2003 – p.68/69
![Page 69: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,](https://reader033.vdocuments.pub/reader033/viewer/2022053018/5f1d40082a358455bd2bfad5/html5/thumbnails/69.jpg)
End of Lecture 6
07/04/2003 – p.69/69