an efficient language model using double-array structures

23
An Efficient Language Model Using Double-Array Structures Makoto Yasuhara, Toru Tanaka Jun-ya Norimatsu , Mikio Yamamoto University of Tsukuba, Japan EMNLP 2013

Upload: jun-ya-norimatsu

Post on 22-Jul-2015

3.587 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: An Efficient Language Model Using Double-Array Structures

An Efficient Language Model

Using Double-Array Structures

Makoto Yasuhara, Toru Tanaka

Jun-ya Norimatsu, Mikio Yamamoto

University of Tsukuba, Japan

EMNLP 2013

Page 2: An Efficient Language Model Using Double-Array Structures

Introduction(1)

Bigger and Bigger LMs

Have you ever encountered these problems?

LMs cannot be load into memory because of their size

The query speed for LMs become a bottleneck of your system

Store compactly, query fast!

Page 3: An Efficient Language Model Using Double-Array Structures

Our System Overview

We call our LM “DALM”

• LM implementation based on double-array structures

• Modified double-array structure to store backward suffix trees

• Two optimization methods to improve efficiency

Page 4: An Efficient Language Model Using Double-Array Structures

Double-Array Structures

(Aoe, 1989)

A fast and compact representation of a trie

What is a double-array structure?

Abstract image

A B

Double-array representation

1

1 1

BASE

CHECK

A trie is represented by two arrays (BASE and CHECK)

ROOT A B

ROOT

Page 5: An Efficient Language Model Using Double-Array Structures

2D Array Implementation of a Trie

2 3

4 5

6

7

B C

BA C

C

1

2 3

4 5 6

7

1

2

3

4

5

6

7

Node#

A B C

Simple and fast but consumes a lot of memory

Sparse array

ROOT

Page 6: An Efficient Language Model Using Double-Array Structures

Compact Representation of a

Sparse 2D Array

2 3

4 5

6

7

1

2

3

4

5

6

7

Node# A B C

2 3

4 5

6

7

Shift 3

Shift 3

Shift 4

2 3 4 6 5 7Merged-NEXT

Merge

Information loss!

Double-array structure modified

to include all information about the original trie

Shift

Page 7: An Efficient Language Model Using Double-Array Structures

Details of Double-Array Structures

(Aoe, 1989)

0 3 3 0 0 4 0

0 0 2 3 2 6

BASE

CHECK

B C

BA C

C

0 1 2 3 4 5 6 7

Definition:

Example:

ROOT

Page 8: An Efficient Language Model Using Double-Array Structures

Efficient Trie Representations for

Ngram Model

B

A C

C

X

Y

Z

X

X

Y

X

Y

(Bell et al., 1990; Stockle, 2002; Germann et al., 2009)

History words are stored in reverse order

Target words are stored in separated lists

Efficient back-off

Backward suffix trees

The B node is

not found

ROOT

Page 9: An Efficient Language Model Using Double-Array Structures

Endmarker Symbols for

Backwards Suffix Trees

B

A C

C

X

Y

Z

X

X

Y

X

Y

B

A C

C

#

X Y Z #

X Y

#

XY#

X

Endmarker symbols (Aoe, 1989) are placed after history words

Target word follows

the endmarker symbol

ROOTROOT

Page 10: An Efficient Language Model Using Double-Array Structures

Double-array Representation of

Backward Suffix Trees

Endmarker symbols are treated as words

A word ID is assigned to the endmarker symbol

B

X

Y

Z

0 2 4 0 0 4 0

0 2 2 3 3 3

BASE

CHECK

0 1 3 4 5 6 72ROOT

Page 11: An Efficient Language Model Using Double-Array Structures

Double-array Language Model:

Simple Structures

Introducing a VALUE array

0 2 5 4

0 3 6

BASE

CHECK

0 1 2 3 4 5 6 7

A

A

BXVALUE

B # X

The VALUE array contains corresponding

probabilities and back-off weights (BOW)

ROOT

Page 12: An Efficient Language Model Using Double-Array Structures

Double-array Language Model:

Embedding structures (1)

Filling unused slots with values

0 2 5 4

0 3 6

BASE

CHECK

0 1 2 3 4 5 6 7

A

A

BX

B # X

Unused slots

These empty slots are used to store values

ROOT

Page 13: An Efficient Language Model Using Double-Array Structures

Double-array Language Model:

Embedding structures (2)

Using the BASE and CHECK arrays to store values

0 2 5 4

0 3 -2 6

BASE

CHECK

0 1 2 3 4 5 6 7

A B # X

VALUE

Lossless

quantization

Index of the VALUE array

with a negative sign

Page 14: An Efficient Language Model Using Double-Array Structures

Double-array Language Model:

Ordering method (1)

Tuning for word IDs

We assign word IDs in order of unigram probability

P(Word) Word Word ID

- # 1

0.0413 B 2

0.0300 X 3

0.0284 A 4

0.0201 Y 5

0.0101 C 6

0.0050 Z 7

0.0020 D 8

Sort the words in

order of descending

probability

Page 15: An Efficient Language Model Using Double-Array Structures

Double-array Language Model:

Ordering method (2)

3 2 13

6 4

6 11

9 8

1

2

3

4

Node# # B X A Y C Z D

3 2 13

6 4

6 11

9 8

1

2

3

4

Node# # B CA YX ZD

Before ordering:

After ordering:

Modifying the 2D array

Page 16: An Efficient Language Model Using Double-Array Structures

Experiments: Datasets

Model Corpus size

[words]

Unique types

[words]

N-grams

(unigrams to

5-grams)

100 Mwords 100 M 195 K 31 M

5 Gwords 5 G 2,140 K 936 M

Test set 100 M 198 K -

Publication of unexamined Japanese patent applications

Data source

Distributed with the NTCIR 3,4,5,6 patent retrieval task

(Iwayama et al., 2003; Fujii et al., 2004;2005;2007)

Page 17: An Efficient Language Model Using Double-Array Structures

Comparison: Proposed Methods

Results for 100-Mword corpus

Page 18: An Efficient Language Model Using Double-Array Structures

Building a large double-array structure needs a lot of time

Dividing the trie into several parts

It is impractical to wait for the 5-Gword model to get built

(Nakamura and Mochizuki, 2006)

A C

C# #

A C

C# #

Division Method

ROOT ROOT ROOT

Page 19: An Efficient Language Model Using Double-Array Structures

Experiments: Division Methods

Results for 100-Mword corpus

Page 20: An Efficient Language Model Using Double-Array Structures

Experiments: Other Methods

Results for 100-Mword and 5-Gword corpora

Page 21: An Efficient Language Model Using Double-Array Structures

Discussion

DALM is smaller and faster than KenLM Probing

The smallest LM is KenLM Trie

The differences between KenLM Probing and DALM are

smaller for the 5-Gword model than for the 100-Mword model

Large language models require shorter back-off time

Page 22: An Efficient Language Model Using Double-Array Structures

Conclusion

We proposed an efficient language model using double-array structures

We proposed two optimization methods: embedding and ordering

In experiments, DALM achieved the best speed among the compared

LMs though keeping modest model size.

• Double-array structures are a fast and compact representation of tries

• We use double-array structures to represent backward suffix trees

• Embedding: using empty slots in the double-array to store values

• Ordering: tuning word IDs to make LMs smaller and faster

Page 23: An Efficient Language Model Using Double-Array Structures

Questions…

My English skills are limited

Please speak slowly if you have any questions.