an efficient language model using double-array structures

An Efficient Language Model

Using Double-Array Structures

Makoto Yasuhara, Toru Tanaka

Jun-ya Norimatsu, Mikio Yamamoto

University of Tsukuba, Japan

EMNLP 2013

Introduction(1)

Bigger and Bigger LMs

Have you ever encountered these problems?

LMs cannot be load into memory because of their size

The query speed for LMs become a bottleneck of your system

Store compactly, query fast!

Our System Overview

We call our LM “DALM”

• LM implementation based on double-array structures

• Modified double-array structure to store backward suffix trees

• Two optimization methods to improve efficiency

Double-Array Structures

(Aoe, 1989)

A fast and compact representation of a trie

What is a double-array structure?

Abstract image

A B

Double-array representation

1

1 1

BASE

CHECK

A trie is represented by two arrays (BASE and CHECK)

ROOT A B

ROOT

2D Array Implementation of a Trie

2 3

4 5

6

7

B C

BA C

C

1

2 3

4 5 6

7

1

2

3

4

5

6

7

Node#

A B C

Simple and fast but consumes a lot of memory

Sparse array

ROOT

Compact Representation of a

Sparse 2D Array

2 3

4 5

6

7

1

2

3

4

5

6

7

Node# A B C

2 3

4 5

6

7

Shift 3

Shift 3

Shift 4

2 3 4 6 5 7Merged-NEXT

Merge

Information loss!

Double-array structure modified

to include all information about the original trie

Shift

Details of Double-Array Structures

(Aoe, 1989)

0 3 3 0 0 4 0

0 0 2 3 2 6

BASE

CHECK

B C

BA C

C

0 1 2 3 4 5 6 7

Definition:

Example:

ROOT

Efficient Trie Representations for

Ngram Model

B

A C

C

X

Y

Z

X

X

Y

X

Y

(Bell et al., 1990; Stockle, 2002; Germann et al., 2009)

History words are stored in reverse order

Target words are stored in separated lists

Efficient back-off

Backward suffix trees

The B node is

not found

ROOT

Endmarker Symbols for

Backwards Suffix Trees

B

A C

C

X

Y

Z

X

X

Y

X

Y

B

A C

C

#

X Y Z #

X Y

#

XY#

X

Endmarker symbols (Aoe, 1989) are placed after history words

Target word follows

the endmarker symbol

ROOTROOT

Double-array Representation of

Backward Suffix Trees

Endmarker symbols are treated as words

A word ID is assigned to the endmarker symbol

B

X

Y

Z

0 2 4 0 0 4 0

0 2 2 3 3 3

BASE

CHECK

0 1 3 4 5 6 72ROOT

Double-array Language Model:

Simple Structures

Introducing a VALUE array

0 2 5 4

0 3 6

BASE

CHECK

0 1 2 3 4 5 6 7

A

A

BXVALUE

B # X

The VALUE array contains corresponding

probabilities and back-off weights (BOW)

ROOT


Embedding structures (1)

Filling unused slots with values

0 2 5 4

0 3 6

BASE

CHECK

0 1 2 3 4 5 6 7

A

A

BX

B # X

Unused slots

These empty slots are used to store values

ROOT


Embedding structures (2)

Using the BASE and CHECK arrays to store values

0 2 5 4

0 3 -2 6

BASE

CHECK

0 1 2 3 4 5 6 7

A B # X

VALUE

Lossless

quantization

Index of the VALUE array

with a negative sign


Ordering method (1)

Tuning for word IDs

We assign word IDs in order of unigram probability

P(Word) Word Word ID

- # 1

0.0413 B 2

0.0300 X 3

0.0284 A 4

0.0201 Y 5

0.0101 C 6

0.0050 Z 7

0.0020 D 8

Sort the words in

order of descending

probability


Ordering method (2)

3 2 13

6 4

6 11

9 8

1

2

3

4

Node# # B X A Y C Z D

3 2 13

6 4

6 11

9 8

1

2

3

4

Node# # B CA YX ZD

Before ordering:

After ordering:

Modifying the 2D array

Experiments: Datasets

Model Corpus size

[words]

Unique types

[words]

N-grams

(unigrams to

5-grams)

100 Mwords 100 M 195 K 31 M

5 Gwords 5 G 2,140 K 936 M

Test set 100 M 198 K -

Publication of unexamined Japanese patent applications

Data source

Distributed with the NTCIR 3,4,5,6 patent retrieval task

(Iwayama et al., 2003; Fujii et al., 2004;2005;2007)

Comparison: Proposed Methods

Results for 100-Mword corpus

Building a large double-array structure needs a lot of time

Dividing the trie into several parts

It is impractical to wait for the 5-Gword model to get built

(Nakamura and Mochizuki, 2006)

A C

C# #

A C

C# #

Division Method

ROOT ROOT ROOT

Experiments: Division Methods

Results for 100-Mword corpus

Experiments: Other Methods

Results for 100-Mword and 5-Gword corpora

Discussion

DALM is smaller and faster than KenLM Probing

The smallest LM is KenLM Trie

The differences between KenLM Probing and DALM are

smaller for the 5-Gword model than for the 100-Mword model

Large language models require shorter back-off time

Conclusion

We proposed an efficient language model using double-array structures

We proposed two optimization methods: embedding and ordering

In experiments, DALM achieved the best speed among the compared

LMs though keeping modest model size.

• Double-array structures are a fast and compact representation of tries

• We use double-array structures to represent backward suffix trees

• Embedding: using empty slots in the double-array to store values

• Ordering: tuning word IDs to make LMs smaller and faster

Questions…

My English skills are limited

Please speak slowly if you have any questions.

an efficient language model using double-array structures

Technology