introduction to kneser-ney smoothing on top of generalized language models for next word prediction

Web Science & Technologies

University of Koblenz ▪ Landau, Germany

Introduction to Kneser-NeySmoothing on Top of Generalized Language

Models for Next Word Prediction

Martin Körner

Oberseminar

25.07.2013

Martin Körner

[email protected]

Oberseminar 25.07.2013

2 of 30

WeST

Content

Introduction

Language Models

Generalized Language Models

Smoothing

Progress

Summary

Martin Körner

[email protected]


3 of 30

WeST

Content

Introduction

Language Models


Smoothing

Progress

Summary

Martin Körner

[email protected]


4 of 30

WeST

Introduction: Motivation

Next word prediction: What is the next word a user will

type?

Use cases for next word prediction:

Augmentative and Alternative

Communication (AAC)

Small keyboards (Smartphones)

Martin Körner

[email protected]


5 of 30

WeST

Introduction to next word prediction

How do we predict words?

1. Rationalist approach

• Manually encoding information about language

• “Toy” problems only

2. Empiricist approach

• Statistical, pattern recognition, and machine learning

methods applied on corpora

• Result: Language models

Martin Körner

[email protected]


6 of 30

WeST

Content

Introduction

Language Models


Smoothing

Progress

Summary

Martin Körner

[email protected]


7 of 30

WeST

Language models in general

Language model: How likely is a sentence 𝑠?

Probability distribution: 𝑃 𝑠

Calculate 𝑃 𝑠 by multiplying conditional probabilities

Example:

𝑃 If you′re going to San Francisco , be sure …=𝑃 you′re | If ∗ 𝑃 going | If you′re ∗𝑃 to | If you′re going ∗ 𝑃 San | If you′re going to ∗𝑃 Francisco | If you′re going to San ∗ ⋯

Empirical approach would fail

Martin Körner

[email protected]


8 of 30

WeST

Conditional probabilities simplified

Markov assumption [JM80]:

Only the last n-1 words are relevant for a prediction

Example with n=5:

𝑃 sure | If you′re going to San Francisco , be

≈ 𝑃 sure | San Francisco , be

Counts as a word

Martin Körner

[email protected]


9 of 30

WeST

Definitions and Markov assumption

n-gram: Sequence of length n with a count

E.g.: 5-gram:

If you′re going to San 4

Sequence naming:

𝑤1𝑖−1 ≔ 𝑤1 𝑤2 …𝑤𝑖−1

Markov assumption formalized:

𝑃 𝑤𝑖 𝑤1𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1

𝑖−1

n-1 words

Martin Körner

[email protected]


10 of 30

WeST

Formalizing next word prediction

Instead of 𝑃(𝑠):

Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1

• Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1 to 𝑃 𝑤𝑛 𝑤1

𝑛−1

NWP 𝑤1𝑛−1 = argmax𝑤𝑛∈𝑊 𝑃 𝑤𝑛 𝑤1

𝑛−1

How to calculate the probability 𝑃 𝑤𝑛 𝑤1𝑛−1 ?

Set of all words in the corpus

n-1 words n-1 words

Conditional probability with Markov assumption

Martin Körner

[email protected]


11 of 30

WeST

How to calculate 𝑃(𝑤𝑛|𝑤1𝑛−1)

The easiest way:

Maximum likelihood:

𝑃ML 𝑤𝑛 𝑤1𝑛−1 =

𝑐(𝑤1𝑛)

𝑐(𝑤1𝑛−1)

Example:

𝑃 San | If you′re going to =𝑐 If you′re going to San

𝑐 If you′re going to

Martin Körner

[email protected]


12 of 30

WeST

Content

Introduction

Language Models


Smoothing

Progress

Summary

Martin Körner

[email protected]


13 of 30

WeST

Intro Generalized Language Models (GLMs)

Main idea:

Insert wildcard words (∗) into sequences

Example:

Instead of 𝑃 San | If you′re going to :

• 𝑃 San | If ∗ ∗ ∗

• 𝑃 San | If ∗ ∗ to

• 𝑃 San | If ∗ going ∗

• 𝑃 San | If ∗ going to

• 𝑃 San | If you′re ∗ ∗

• …

Separate different types of GLMs based on:

1. Sequence length

2. Number of wildcard words

Aggregate results

Length: 5, Wildcard words: 2

Martin Körner

[email protected]


14 of 30

WeST

Why Generalized Language Models?

Data sparsity of n-grams

“If you′re going to San” is seen less often than for example

“If ∗ ∗ to San”

Question: Does that really improve the prediction?

Result of evaluation: Yes

… but we should use smoothing for language models

Martin Körner

[email protected]


15 of 30

WeST

Content

Introduction

Language Models


Smoothing

Progress

Summary

Martin Körner

[email protected]


16 of 30

WeST

Smoothing

Problem: Unseen sequences

Try to estimate probabilities of unseen sequences

Probabilities of seen sequences need to be reduced

Two approaches:

1. Backoff smoothing

2. Interpolation smoothing

Martin Körner

[email protected]


17 of 30

WeST

Backoff smoothing

If sequence unseen: use shorter sequence

E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to

𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖𝑛−1 =

𝜏 𝑤𝑛 𝑤𝑖𝑛−1 𝑖𝑓 𝑐 𝑤𝑖

𝑛 > 0

𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖+1𝑛−1 𝑖𝑓 𝑐 𝑤𝑖

𝑛 = 0

Weight Lower order

probability (recursive)

Higher order

probability

Martin Körner

[email protected]


18 of 30

WeST

Interpolated Smoothing

Always use shorter sequence for calculation

𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖𝑛−1 = 𝜏 𝑤𝑛 𝑤𝑖

𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖+1𝑛−1

Seems to work better than backoff smoothing

Higher order

probability

Weight Lower order

probability (recursive)

Martin Körner

[email protected]


19 of 30

WeST

Kneser-Ney smoothing [KN95] intro

Interpolated smoothing

Idea: Improve lower order calculation

Example: Word visiting unseen in corpus

𝑃 Francisco | visiting = 0

Normal interpolation: 0 + γ ∗ 𝑃 Francisco

𝑃 San | visiting = 0

Normal interpolation: 0 + γ ∗ 𝑃 San

Result: Francisco is as likely as San at that position

Is that correct?

Difference between Francisco and San?

Answer: Number of different contexts

Martin Körner

[email protected]


20 of 30

WeST

Kneser-Ney smoothing idea

For lower order calculation:

Don’t use 𝑐 𝑤𝑛 Instead: Number of different bigrams the word completes:

𝑁1+ • 𝑤𝑛 ≔ 𝑤𝑛−1: 𝑐 𝑤𝑛−1𝑛 > 0

Or in general:

𝑁1+ • 𝑤𝑖+1𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖

𝑛 > 0

In addition:

𝑁1+ • 𝑤𝑖+1𝑛−1• = 𝑤𝑛

𝑁1+ • 𝑤𝑖+1𝑛

𝑁1+ 𝑤𝑖𝑛−1 • = 𝑤𝑛: 𝑐 𝑤𝑖

𝑛 > 0

Count

Martin Körner

[email protected]


21 of 30

WeST

Kneser-Ney smoothing equation (highest)

Highest order calculation:

𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =

max{𝑐 𝑤𝑖𝑛 − 𝐷, 0}

𝑐 𝑤𝑖𝑛−1

+

𝐷

𝑐 𝑤𝑖𝑛−1

𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1

𝑛−1

count

Total counts

Assure positive valueDiscount value

0 ≤ 𝐷 ≤ 1

Lower order probability

(recursion)

Lower order weight

Martin Körner

[email protected]


22 of 30

WeST

Kneser-Ney smoothing equation

Lower order calculation:


max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷, 0}

𝑁1+ • 𝑤𝑖𝑛−1 •

+

𝐷

𝑁1+ • 𝑤𝑖𝑛−1 •

𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1

𝑛−1

Lowest order calculation: 𝑃KN 𝑤𝑛 =𝑁1+ •𝑤𝑖

𝑛

𝑁1+ •𝑤𝑖𝑛−1•

Continuation count

Total continuation counts

Assure positive valueDiscount value

Lower order probability

(recursion)

Lower order weight

Martin Körner

[email protected]


23 of 30

WeST

Modified Kneser-Ney smoothing [CG98]

Different discount values for different absolute counts

Lower order calculation:


max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷(𝑐 𝑤𝑖

𝑛 ), 0}

𝑁1+ • 𝑤𝑖𝑛−1 •

+

𝐷1𝑁1 𝑤𝑖𝑛−1 • + 𝐷2𝑁2 𝑤𝑖

𝑛−1 • + 𝐷3+𝑁3+ 𝑤𝑖𝑛−1 •

𝑁1+ • 𝑤𝑖𝑛−1 •

𝑃KN 𝑤𝑛 𝑤𝑖+1𝑛−1

State of the art (since 15 years!)

Martin Körner

[email protected]


24 of 30

WeST

Smoothing of GLMs

We can use all smoothing techniques on GLMs as well!

Small modification:

E.g: 𝑃 San | If ∗ going ∗

Lower order sequence :

– Normally: 𝑃 San | ∗ going ∗

– Instead use 𝑃 San | going ∗

Martin Körner

[email protected]


25 of 30

WeST

Content

Introduction

Language Models


Smoothing

Progress

Summary

Martin Körner

[email protected]


26 of 30

WeST

Progress

Done Yet:

Extract text from XML files

Building GLMs

Kneser-Ney and modified Kneser-Ney smoothing

Indexing with MySQL

ToDo’s

Finish evaluation program

Run evaluation

Analyze results

Martin Körner

[email protected]


27 of 30

WeST

Content

Introduction

Language Models


Smoothing

Progress

Summary

Martin Körner

[email protected]


28 of 30

WeST

Summary

Data Sets Language Models Smoothing

• More Data

• Better Data• Katz

• Good-Turing

• Witten-Bell

• Kneser-Ney

• …

• n-grams

• Generalized

Language

Models

Martin Körner

[email protected]


29 of 30

WeST

Thank you for your attention!

Questions?

Martin Körner

[email protected]


30 of 30

WeST

Sources

Images: Wheelchair Joystick (Slide 4):

http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg

Smartphone Keyboard (Slide 4):

https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg

References: [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing

techniques for language modeling. Technical report, Technical Report TR-10-

98, Harvard University, August, 1998.

[JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source

parameters from sparse data. In Proceedings of the Workshop on Pattern

Recognition in Practice, pages 381–397, 1980.

[KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram

language modeling. In Acoustics, Speech, and Signal Processing, 1995.

ICASSP-95., 1995 International Conference on, volume 1, pages 181–184.

IEEE, 1995.

http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg

https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg

introduction to kneser-ney smoothing on top of generalized language models for next word prediction

Technology