text boundary analysis eric mader advisory software engineer ibm

154
Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Upload: ariel-davis

Post on 17-Jan-2018

244 views

Category:

Documents


0 download

DESCRIPTION

Where do I break lines? The rain in Spain stays mainly on the plain. 您有坦率和誠實的聲譽。

TRANSCRIPT

Page 1: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Text Boundary Analysis

Eric MaderAdvisory Software Engineer

IBM

Page 2: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Where do I break lines?

The rain in Spain stays mainly on the plain.

Page 3: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Where do I break lines?

The rain in Spain stays mainly on the plain.

您有坦率和誠實的聲譽。

Page 4: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Where do I break lines?

The rain in Spain stays mainly on the plain.

ด่ๅแรงฃนึ๓อัตราลกูจา้งใหมใ่ห๓้๕

您有坦率和誠實的聲譽。

Page 5: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Even in English, this can be hard

You owe me $1,234.56... I think.

Page 6: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Even in English, this can be hard

You owe me $1,234.56... I think.

Page 7: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Word wrapping vs word selection

Some characters’ behavior is context-dependent.

Word wrapping:

Page 8: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Some characters’ behavior is context-dependent.

Some characters’ behavior is context-dependent.

Word wrapping:

Searching by words:

Word wrapping vs word selection

Page 9: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

Page 10: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

Page 11: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

Page 12: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

Page 13: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

Page 14: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

Page 15: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

kji X X X X

kji

X

X

X

X

X

X

Page 16: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Where pairs break down

You owe me $1,234.56... I think.

A break position can depend on more than two characters:

Page 17: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Where pairs break down

You owe me $1,234.56... I think.

4.5

A break position can depend on more than two characters:

Page 18: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Where pairs break down

You owe me $1,234.56... I think.

6..

A break position can depend on more than two characters:

Page 19: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

Page 20: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

Page 21: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

Page 22: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

Page 23: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

Page 24: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

An example•If not otherwise mentioned, each character is a “word” unto itself.

•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.

Page 25: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 26: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 27: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 28: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 29: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 30: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 31: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 32: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 33: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

Page 34: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 35: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 36: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 37: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 38: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 39: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 40: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 41: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 42: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 43: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 44: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 45: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 46: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Page 47: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Limitations

1992–1996

Page 48: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Limitations

1992–1996

Page 49: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Limitations

–1996

Page 50: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Limitations

1992–1996

Page 51: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Limitations

1992–1996

Page 52: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Limitations

1992–1996

Page 53: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Limitations

1992–1996

Page 54: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Automatic table building•If not otherwise mentioned, each character is a “word” unto itself.

•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.

Page 55: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 56: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 57: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 58: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 59: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 60: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 61: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 62: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 63: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 64: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 65: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 66: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 67: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 68: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 69: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 70: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 71: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 72: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 73: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 74: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 75: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 76: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Page 77: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Automatic table building

•All regular-expression rules have equal precedence

•The “winning” rule is decided using a longest-possible-match algorithm (except in certain well-defined cases)

•Our build algorithm parses the regular expressions, builds the state table, and makes sure it’s deterministic in a single pass

Page 78: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 79: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 80: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 81: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 82: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term{[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 83: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 84: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 85: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Page 86: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Ignore characters

$ignore=[[:Mn:][:Me:][:Cf:]];

Page 87: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Page 88: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Page 89: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Page 90: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Random-access iteration

You owe me $1,234.56... I think.

Page 91: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Random-access iteration

You owe me $1,234.56... I think.

Page 92: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Random-access iteration

You owe me $1,234.56... I think.

Page 93: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Random-access iteration

You owe me $1,234.56... I think.

Page 94: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Random-access iteration

You owe me $1,234.56... I think.

Page 95: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Random-access iteration

You owe me $1,234.56... I think.

Page 96: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Random-access iteration

!{sent-start}{start}*{space}*{end}*{period};![{sent-start}{lc}{digit}]{start}*{space}*{end}*{term};

Page 97: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

We hold these truths to be self-evident: that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are Life, Liberty, and the Pursuit of Happiness.

Page 98: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Page 99: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

$dictionary=[A-Za-z\-\’];

Page 100: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Page 101: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Page 102: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Page 103: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 104: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 105: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 106: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 107: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 108: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 109: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 110: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 111: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 112: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 113: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 114: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 115: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 116: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 117: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 118: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 119: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 120: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 121: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 122: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 123: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 124: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 125: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 126: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 127: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 128: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 129: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 130: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 131: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 132: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 133: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 134: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 135: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 136: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 137: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 138: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 139: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 140: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 141: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 142: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 143: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 144: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 145: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 146: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 147: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 148: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 149: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 150: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 151: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 152: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 153: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Dictionary-based iteration

themendinetonight

Page 154: Text Boundary Analysis Eric Mader Advisory Software Engineer IBM

Text Boundary Analysis

Eric MaderAdvisory Software Engineer

IBM