an efficient language model using double-array structures
TRANSCRIPT
An Efficient Language Model
Using Double-Array Structures
Makoto Yasuhara, Toru Tanaka
Jun-ya Norimatsu, Mikio Yamamoto
University of Tsukuba, Japan
EMNLP 2013
Introduction(1)
Bigger and Bigger LMs
Have you ever encountered these problems?
LMs cannot be load into memory because of their size
The query speed for LMs become a bottleneck of your system
Store compactly, query fast!
Our System Overview
We call our LM “DALM”
• LM implementation based on double-array structures
• Modified double-array structure to store backward suffix trees
• Two optimization methods to improve efficiency
Double-Array Structures
(Aoe, 1989)
A fast and compact representation of a trie
What is a double-array structure?
Abstract image
A B
Double-array representation
1
1 1
BASE
CHECK
A trie is represented by two arrays (BASE and CHECK)
ROOT A B
ROOT
2D Array Implementation of a Trie
2 3
4 5
6
7
B C
BA C
C
1
2 3
4 5 6
7
1
2
3
4
5
6
7
Node#
A B C
Simple and fast but consumes a lot of memory
Sparse array
ROOT
Compact Representation of a
Sparse 2D Array
2 3
4 5
6
7
1
2
3
4
5
6
7
Node# A B C
2 3
4 5
6
7
Shift 3
Shift 3
Shift 4
2 3 4 6 5 7Merged-NEXT
Merge
Information loss!
Double-array structure modified
to include all information about the original trie
Shift
Details of Double-Array Structures
(Aoe, 1989)
0 3 3 0 0 4 0
0 0 2 3 2 6
BASE
CHECK
B C
BA C
C
0 1 2 3 4 5 6 7
Definition:
Example:
ROOT
Efficient Trie Representations for
Ngram Model
B
A C
C
X
Y
Z
X
X
Y
X
Y
(Bell et al., 1990; Stockle, 2002; Germann et al., 2009)
History words are stored in reverse order
Target words are stored in separated lists
Efficient back-off
Backward suffix trees
The B node is
not found
ROOT
Endmarker Symbols for
Backwards Suffix Trees
B
A C
C
X
Y
Z
X
X
Y
X
Y
B
A C
C
#
X Y Z #
X Y
#
XY#
X
Endmarker symbols (Aoe, 1989) are placed after history words
Target word follows
the endmarker symbol
ROOTROOT
Double-array Representation of
Backward Suffix Trees
Endmarker symbols are treated as words
A word ID is assigned to the endmarker symbol
B
X
Y
Z
0 2 4 0 0 4 0
0 2 2 3 3 3
BASE
CHECK
0 1 3 4 5 6 72ROOT
Double-array Language Model:
Simple Structures
Introducing a VALUE array
0 2 5 4
0 3 6
BASE
CHECK
0 1 2 3 4 5 6 7
A
A
BXVALUE
B # X
The VALUE array contains corresponding
probabilities and back-off weights (BOW)
ROOT
Double-array Language Model:
Embedding structures (1)
Filling unused slots with values
0 2 5 4
0 3 6
BASE
CHECK
0 1 2 3 4 5 6 7
A
A
BX
B # X
Unused slots
These empty slots are used to store values
ROOT
Double-array Language Model:
Embedding structures (2)
Using the BASE and CHECK arrays to store values
0 2 5 4
0 3 -2 6
BASE
CHECK
0 1 2 3 4 5 6 7
A B # X
VALUE
Lossless
quantization
Index of the VALUE array
with a negative sign
Double-array Language Model:
Ordering method (1)
Tuning for word IDs
We assign word IDs in order of unigram probability
P(Word) Word Word ID
- # 1
0.0413 B 2
0.0300 X 3
0.0284 A 4
0.0201 Y 5
0.0101 C 6
0.0050 Z 7
0.0020 D 8
Sort the words in
order of descending
probability
Double-array Language Model:
Ordering method (2)
3 2 13
6 4
6 11
9 8
1
2
3
4
Node# # B X A Y C Z D
3 2 13
6 4
6 11
9 8
1
2
3
4
Node# # B CA YX ZD
Before ordering:
After ordering:
Modifying the 2D array
Experiments: Datasets
Model Corpus size
[words]
Unique types
[words]
N-grams
(unigrams to
5-grams)
100 Mwords 100 M 195 K 31 M
5 Gwords 5 G 2,140 K 936 M
Test set 100 M 198 K -
Publication of unexamined Japanese patent applications
Data source
Distributed with the NTCIR 3,4,5,6 patent retrieval task
(Iwayama et al., 2003; Fujii et al., 2004;2005;2007)
Comparison: Proposed Methods
Results for 100-Mword corpus
Building a large double-array structure needs a lot of time
Dividing the trie into several parts
It is impractical to wait for the 5-Gword model to get built
(Nakamura and Mochizuki, 2006)
A C
C# #
A C
C# #
Division Method
ROOT ROOT ROOT
Experiments: Division Methods
Results for 100-Mword corpus
Experiments: Other Methods
Results for 100-Mword and 5-Gword corpora
Discussion
DALM is smaller and faster than KenLM Probing
The smallest LM is KenLM Trie
The differences between KenLM Probing and DALM are
smaller for the 5-Gword model than for the 100-Mword model
Large language models require shorter back-off time
Conclusion
We proposed an efficient language model using double-array structures
We proposed two optimization methods: embedding and ordering
In experiments, DALM achieved the best speed among the compared
LMs though keeping modest model size.
• Double-array structures are a fast and compact representation of tries
• We use double-array structures to represent backward suffix trees
• Embedding: using empty slots in the double-array to store values
• Ordering: tuning word IDs to make LMs smaller and faster
Questions…
My English skills are limited
Please speak slowly if you have any questions.