text boundary analysis eric mader advisory software engineer ibm

Post on 17-Jan-2018

244 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Where do I break lines? The rain in Spain stays mainly on the plain. 您有坦率和誠實的聲譽。

TRANSCRIPT

Text Boundary Analysis

Eric MaderAdvisory Software Engineer

IBM

Where do I break lines?

The rain in Spain stays mainly on the plain.

Where do I break lines?

The rain in Spain stays mainly on the plain.

您有坦率和誠實的聲譽。

Where do I break lines?

The rain in Spain stays mainly on the plain.

ด่ๅแรงฃนึ๓อัตราลกูจา้งใหมใ่ห๓้๕

您有坦率和誠實的聲譽。

Even in English, this can be hard

You owe me $1,234.56... I think.

Even in English, this can be hard

You owe me $1,234.56... I think.

Word wrapping vs word selection

Some characters’ behavior is context-dependent.

Word wrapping:

Some characters’ behavior is context-dependent.

Some characters’ behavior is context-dependent.

Word wrapping:

Searching by words:

Word wrapping vs word selection

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

kji X X X X

kji

X

X

X

X

X

X

Where pairs break down

You owe me $1,234.56... I think.

A break position can depend on more than two characters:

Where pairs break down

You owe me $1,234.56... I think.

4.5

A break position can depend on more than two characters:

Where pairs break down

You owe me $1,234.56... I think.

6..

A break position can depend on more than two characters:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

An example•If not otherwise mentioned, each character is a “word” unto itself.

•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Limitations

1992–1996

Limitations

1992–1996

Limitations

–1996

Limitations

1992–1996

Limitations

1992–1996

Limitations

1992–1996

Limitations

1992–1996

Automatic table building•If not otherwise mentioned, each character is a “word” unto itself.

•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Automatic table building

•All regular-expression rules have equal precedence

•The “winning” rule is decided using a longest-possible-match algorithm (except in certain well-defined cases)

•Our build algorithm parses the regular expressions, builds the state table, and makes sure it’s deterministic in a single pass

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term{[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Ignore characters

$ignore=[[:Mn:][:Me:][:Cf:]];

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

!{sent-start}{start}*{space}*{end}*{period};![{sent-start}{lc}{digit}]{start}*{space}*{end}*{term};

Dictionary-based iteration

We hold these truths to be self-evident: that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are Life, Liberty, and the Pursuit of Happiness.

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

$dictionary=[A-Za-z\-\’];

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Text Boundary Analysis

Eric MaderAdvisory Software Engineer

IBM

top related