· promotoren : prof. dr. ir. m. bruynooghe prof. dr. j. van den bussche januar y 2008 information...

Promotoren :Prof. Dr. ir. M. BruynoogheProf. Dr. J. Van Den Bussche

January 2008

INFORMATION EXTRACTION FROM WEB PAGES BASED ON TREE AUTOMATA INDUCTION

Proefschrift voorgedragen tot het behalen van het doctoraat in de ingenieurswetenschappen

door

Stefan RAEyMAEkERS

kATHOLIEkE UNIVERSITEIT LEUVEN FacuLTeIT IngenIeursWeTenschaPPenDeParTeMenT coMPuTerWeTenschaPPenaFDeLIng InForMaTIcacelestijnenlaan 200 a — B-3001 Leuven

INFO

RM

ATION

EXTR

AC

TION

FRO

M W

EB PAG

ES BASED

ON

TREE A

UTO

MATA IN

DU

CTIO

Nstefan r

aey

Ma

eker

s – – January 2008

Jury:Prof. Dr. ir. D. Vandermeulen, voorzitterProf. Dr. ir. M. Bruynooghe, promotorProf. Dr. J. Van den Bussche, promotor(Universiteit Hasselt)Prof. Dr. ir. H. Blockeel, assessor Prof. Dr. ir. E. Duval, assessor Prof. Dr. M. F. Moens, assessor Prof. Dr. L. De RaedtProf. Dr. ir. S. Flesca (Università della Calabria)Dr. Joachim Niehren (INRIA, Lille, France)

January 2008

KATHOLIEKE UNIVERSITEIT LEUVEN FACULTEIT INGENIEURSWETENSCHAPPENDEPARTEMENT COMPUTERWETENSCHAPPENAFDELING INFORMATICACelestijnenlaan 200 A — B-3001 Leuven

INFORMATION EXTRACTION FROM WEB PAGES BASED ON TREE AUTOMATA INDUCTION

Proefschrift voorgedragen tot het behalen van het doctoraat in de ingenieurswetenschappen

door

Stefan RAEyMAEKERS

U.D.C. 681.3*I26

c©Katholieke Universiteit Leuven – Faculteit IngenieurswetenschappenArenbergkasteel, B-3001 Heverlee (Belgium)

Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigden/of openbaar gemaakt worden door middel van druk, fotocopie, microfilm,elektronisch of op welke andere wijze ook zonder voorafgaande schriftelijke toe-stemming van de uitgever.

All rights reserved. No part of the publication may be reproduced in any formby print, photoprint, microfilm or any other means without written permissionfrom the publisher.

D/2008/7515/7ISBN 978–90–5682–898–1

Abstract

The World Wide Web is an invaluable source of information. Unfortunately, whilethis information is easily interpreted by a human reader, it is not straightforwardto extract and process relevant data using computer programs. The aim in ‘Infor-mation Extraction from web pages’ is to learn how to extract specific informationfrom structured text, based on a number of examples, where an example consistsof a web page with one of the occurrences of the target data indicated. Alterna-tively, the term ‘wrapper induction’ is often used, where a wrapper stands for aprocedure to extract data from a web page.

In short, we present in this thesis a general way to represent wrappers withtree automata and we develop a specific technique for inducing wrappers in thisformat, that outperforms in experiments other related state of the art techniques.

To this end we introduce and discuss an improved representation for tree au-tomata and we elaborate on the existence, uniqueness and construction of mini-mal and deterministic automata. An approach to use tree automata as wrappersis presented and worked out until fit for practical use. This includes an efficientalgorithm to perform extraction.

We introduce a novel algorithm for the induction of tree languages in general,not specific for wrappers. This algorithm is capable to learn from positive examplesonly, because it learns within a subclass of the regular tree languages that islearnable from positive examples only, in contrast to the whole class of regulartree languages. We adapt this induction algorithm for wrapper induction andextend it to a practical system, which includes choosing parameters, incrementalannotation of examples, and a graphical interface.

i

ii Abstract

Dankwoord

Het is zover, het schrijven van het laatste stukje tekst. Waarom zo lang gewacht?Dit dankwoord zal toch helemaal vooraan komen te staan? Niet omdat ik het bestevoor laatst wou houden. Eerlijk gezegd keek ik er stiekem al een tijdje tegenop omdit te moeten schrijven. Hoe zou je zelf zijn als je je geplaatst ziet voor de schieronmogelijke taak om iedereen voldoende te bedanken voor de ongelofelijke steunen vriendschap die ik tijdens het schrijven van alle verdere pagina’s mocht ervaren?Maar nu ik aan deze taak begonnen ben en alle herinneringen terug boven komenbegin ik met genoegen aan deze taak, of beter, aan deze uitgelezen kans om julliete bedanken. En als mijn dank niet altijd even exuberant is weergegeven, zij dieme een beetje kennen weten dat ze zeker gemeend is.

Wat dit doctoraat betreft wil ik beginnen met Maurice en Hendrik te bedanken,die me, na mijn eerste onderzoeksproject, de kans gaven om de onderzoeksgroepMachine Learning te vervoegen. Ik bedank mijn promotoren Maurice en Jan die mevoorstelden dit onderwerp uit te werken tot een doctoraat. Vooral Maurice, die de‘dagelijkse supervisie’ op zich genomen heeft, en me daarbij toch de vrijheid en hetvertrouwen gaf, wanneer ik een aantal topics rustig alleen wou doorworstelen. Ikheb het ook altijd heel sterk geapprecieerd dat hij bij het naderen van een deadlinetoch altijd de tijd vond om mijn teksten secuur na te lezen of te herschrijven,hoewel ik dat in de stress van de deadline misschien niet altijd duidelijk toonde.

De leden van mijn jury bedank ik omdat ze toegestemd hebben om in mijnjury te zetelen, maar zeker ook omdat ze zich duidelijk plichtsbewust van die taakgekweten hebben en me voorzien hebben van vele waardevolle commentaren, diede kwaliteit van de uiteindelijke tekst zeker ten goede gekomen zijn.

Ik wil ook Harry bedanken. Niet alleen omdat hij een eerste ruwe versie vandeze tekst heeft nagelezen op taalfouten (alle overgebleven fouten zijn zonder enigetwijfel te wijten aan mijn, zonder valse bescheidenheid, groot talent om ze voorhem verborgen te houden), maar ook voor de vele lunches over de afgelopen jaren.Lunches waar we voluit over alles behalve werk (of misschien toch ook een beetjewerk?) konden praten, met wijze raad inbegrepen in het menu.

En dan zijn er nog de vele collega’s waarvan ik de afgelopen jaren heel veelgeleerd heb. Maar belangrijker nog was de aangename werkomgeving die zij creeer-

iii

iv Dankwoord

den. Zonder de nodige ontspanning, tijdens middagpauzes of late congresnachten,zou mijn inspiratie al vlug opgedroogd geweest zijn. En onder die collega’s zittennatuurlijk de mensen die al die jaren, elke werkdag intens met mij meebeleefdhebben: mijn bureaugenoten. Anneleen en Celine, die me ook vergezeld hebbenop mijn eerste internationale congres. Wisten we toen veel dat dat een ervaringwas die door geen ander congres meer geevenaard zou worden. En later kwamenJoaquin en Werner erbij. Zij zijn er in geslaagd door lief en leed voor een heelaangename sfeer op ons bureau te zorgen. Onder dat leed zaten zeker mijn tal-loze vragen die ze behulpzaam beantwoordden en mijn geklaag, welk ze stoıcijnsverdroegen, als ik weer tegen een deadline aan het werken was.

Ook de vrienden die me wekelijks minstens een onvergetelijke avond bezorgenwil ik hier even in het spotlicht zetten. Bedankt Veerle, Wim, Dieter, Raf, Nancy,Nico en Siska, Luk en Hilary, Bart en Sandra. Niet enkel voor de afgelopen 10 jaarmaar zeker ook voor de vele jaren die nog zullen volgen. Nu de laatse regels voordit doctoraat bijna geschreven zijn heb ik weer de tijd om er volop tegenaan tegaan. Ook Kris wil ik bedanken voor zijn jarenlange vriendschap en om er geweestte zijn op die momenten dat ik grote behoefte had aan een luisterend oor.

Uiteindelijk wil ik nog even stil staan bij iemand die ik niet meer kan bedankenvoor alles wat ze voor mij gedaan heeft, maar die bij het in de hand houden vandit boekje heel trots zou geweest zijn, zoals alleen een moeder dat kan.

Acknowledgements

We wish to express our gratitude to Ion Muslea for his feedback on our questionsregarding the STALKER system, and the Co-Testing approach.

We acknowledge the Fondazione Bruno Kessler for granting us the use of theirTIES software, containing an implementation of the BWI algorithm.

v

vi Acknowledgements

Contents

1 Introduction 11.1 Information Extraction from Structured Text . . . . . . . . . . . . 21.2 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Overview of the Text . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Information Extraction from Web Pages 72.1 Web Pages and HTML . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Information Extraction Task . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Extraction Task Examples . . . . . . . . . . . . . . . . . . . 102.2.1.1 Student List . . . . . . . . . . . . . . . . . . . . . 112.2.1.2 Paper Database . . . . . . . . . . . . . . . . . . . 112.2.1.3 Restaurant Guide . . . . . . . . . . . . . . . . . . 13

2.2.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.3 Single Field Extraction versus Tuple Extraction . . . . . . . 152.2.4 Node Extraction versus Sub Node Extraction . . . . . . . . 152.2.5 Element Extraction versus Value Extraction . . . . . . . . . 16

2.3 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Automata 213.1 Alphabets, Strings and Trees . . . . . . . . . . . . . . . . . . . . . 213.2 String Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 String Automata Operations . . . . . . . . . . . . . . . . . 29

3.2.2.1 A General Construction Algorithm . . . . . . . . . 303.2.2.2 Copy and Negation . . . . . . . . . . . . . . . . . 303.2.2.3 Union and Intersection . . . . . . . . . . . . . . . 323.2.2.4 Concatenation . . . . . . . . . . . . . . . . . . . . 343.2.2.5 Iteration . . . . . . . . . . . . . . . . . . . . . . . 35

vii

viii Contents

3.2.3 Minimization of String Automata . . . . . . . . . . . . . . . 363.2.3.1 Existence of a Minimal Set of String States . . . . 363.2.3.2 State Minimization . . . . . . . . . . . . . . . . . 373.2.3.3 Input Minimization . . . . . . . . . . . . . . . . . 38

3.2.4 Determinization of String Automata . . . . . . . . . . . . . 383.3 Tree Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.2 Representation of the Transition Function . . . . . . . . . . 423.3.3 Alternative Representations . . . . . . . . . . . . . . . . . . 433.3.4 Tree Automata Operations . . . . . . . . . . . . . . . . . . 46

3.3.4.1 A General Construction Algorithm . . . . . . . . . 473.3.4.2 Copy and Negation . . . . . . . . . . . . . . . . . 483.3.4.3 Union and Intersection . . . . . . . . . . . . . . . 51

3.3.5 Minimization of Tree Automata . . . . . . . . . . . . . . . . 513.3.5.1 Existence of a Minimal Set of Tree States . . . . . 533.3.5.2 Equivalence Properties of the Transition Function 543.3.5.3 Minimization of Tree Automata . . . . . . . . . . 563.3.5.4 Comparison and Experimental Evaluation . . . . . 58

3.3.6 Determinization of Tree Automata . . . . . . . . . . . . . . 603.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 Information Extraction with Automata 654.1 Marked Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2 Representing Wrappers . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.1 Correct Markings . . . . . . . . . . . . . . . . . . . . . . . . 674.2.2 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.3 Conversion from CCM to PCM . . . . . . . . . . . . . . . . 724.2.4 Conversion from PCM to CCM . . . . . . . . . . . . . . . . 754.2.5 Conversion from SCM to PCM . . . . . . . . . . . . . . . . 79

4.3 Combining Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . 854.3.1 Overlap between Extraction Tasks . . . . . . . . . . . . . . 87

4.4 Extraction in a Single Run . . . . . . . . . . . . . . . . . . . . . . 884.4.1 Keeping Track of Extractions . . . . . . . . . . . . . . . . . 894.4.2 Single Run Extraction . . . . . . . . . . . . . . . . . . . . . 914.4.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5 (k, l)-Contextual Tree Languages 995.1 k-Contextual String Languages . . . . . . . . . . . . . . . . . . . . 100

5.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.1.2 Generalization Power versus Expressiveness . . . . . . . . . 102

5.2 (k, l)-Contextual Tree Languages . . . . . . . . . . . . . . . . . . . 103

Contents ix

5.3 Learning (k, l)-Contextual Tree Languages . . . . . . . . . . . . . . 1075.4 Learning (k, l)-Contextual Tree Acceptors . . . . . . . . . . . . . . 109

5.4.1 Fork Set Acceptor . . . . . . . . . . . . . . . . . . . . . . . 1105.4.1.1 Incremental Construction . . . . . . . . . . . . . . 1105.4.1.2 Constructing Directly from a Tree . . . . . . . . . 111

5.4.2 Conversion to (k, l)-Contextual Tree Acceptor . . . . . . . . 1135.4.2.1 The Conversion Algorithm . . . . . . . . . . . . . 1135.4.2.2 More Optimal Representation . . . . . . . . . . . 122

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6 Wrapper Induction with (k, l)-Contextual Tree Languages 1276.1 Information Extraction with (k, l)-Contextual Tree Languages . . . 128

6.1.1 Practical Wrapper Induction . . . . . . . . . . . . . . . . . 1286.1.2 Learning Wrappers as Automata . . . . . . . . . . . . . . . 131

6.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1326.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3 Learning the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 1346.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.3.2 Learning with Context . . . . . . . . . . . . . . . . . . . . . 139

6.4 Induction with Equivalence Queries . . . . . . . . . . . . . . . . . . 1396.4.1 Interactive Algorithm . . . . . . . . . . . . . . . . . . . . . 1406.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7 Related Work and Experimental Comparison 1457.1 String Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.1.1 STALKER . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.1.1.1 STALKER with Co-Testing . . . . . . . . . . . . . 149

7.1.2 BWI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.2 Tree Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.2.1 The Local Unranked Tree Inference Algorithm . . . . . . . 1507.2.2 SQUIRREL . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.3.1 Positive Examples Only Approaches . . . . . . . . . . . . . 1537.3.2 Interactive Approaches . . . . . . . . . . . . . . . . . . . . . 155

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8 Hybrid Approach 1618.1 Occurrences of Sub Node Fields . . . . . . . . . . . . . . . . . . . . 1618.2 Possible Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 1628.3 Interactive System . . . . . . . . . . . . . . . . . . . . . . . . . . . 1638.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1648.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

x Contents

8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

9 Conclusions and Further Work 1679.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1679.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

9.2.1 Tree Automata Optimization . . . . . . . . . . . . . . . . . 1689.2.2 Extensions to (k, l)-Contextual Tree Languages . . . . . . . 1699.2.3 Wrapper Extensions . . . . . . . . . . . . . . . . . . . . . . 171

References 173

Publication List 177

Biography 181

List of Figures

1.1 The structure of the text . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 A small HTML example . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Tree view of a small HTML document . . . . . . . . . . . . . . . . 92.3 The ‘Student List’ example . . . . . . . . . . . . . . . . . . . . . . 122.4 The ‘Paper Database’ example . . . . . . . . . . . . . . . . . . . . 132.5 The ‘Restaurant Guide’ example . . . . . . . . . . . . . . . . . . . 142.6 Illustration of node and sub node fields . . . . . . . . . . . . . . . . 16

3.1 Graphical representation of trees . . . . . . . . . . . . . . . . . . . 233.2 Example of a finite state automaton . . . . . . . . . . . . . . . . . 273.3 The run of an FSA on a string . . . . . . . . . . . . . . . . . . . . 273.4 Incomplete and complete FSA / unreachable states . . . . . . . . . 293.5 Influence of input alphabet on negation . . . . . . . . . . . . . . . 323.6 Illustration of the union operator . . . . . . . . . . . . . . . . . . . 333.7 Illustration of the concatenation operator . . . . . . . . . . . . . . 353.8 Illustration of the iteration operator . . . . . . . . . . . . . . . . . 363.9 Illustration of FSA determinization . . . . . . . . . . . . . . . . . . 403.10 Processing a tree by a tree automaton. . . . . . . . . . . . . . . . . 413.11 Complete and incomplete FTA . . . . . . . . . . . . . . . . . . . . 433.12 An example FTA with illustration of bottom-up run. . . . . . . . . 443.13 First alternative representation for transition function . . . . . . . 453.14 Second alternative representation for transition function . . . . . . 463.15 A stepwise tree automaton . . . . . . . . . . . . . . . . . . . . . . . 473.16 Illustration of the union operator for tree automata . . . . . . . . . 523.17 Examples of FTAs in different representations . . . . . . . . . . . . 553.18 Dependency graphs for refinement operators . . . . . . . . . . . . . 573.19 A partial run of the minimization algorithm . . . . . . . . . . . . . 583.20 Minimal automata in different representations . . . . . . . . . . . . 593.21 Illustration of FTA determinization . . . . . . . . . . . . . . . . . . 62

xi

xii List of Figures

4.1 Correct marking acceptors (string) . . . . . . . . . . . . . . . . . . 694.2 Correct marking acceptors (tree) . . . . . . . . . . . . . . . . . . . 714.3 EM and CCM accepting markings outside domain . . . . . . . . . 724.4 Conversion from CCM acceptor to PCM acceptor (string) . . . . . 744.5 Conversion from CCM acceptor to PCM acceptor (tree) . . . . . . 764.6 Conversion from PCM acceptor to CCM acceptor (string) . . . . . 794.7 Conversion from PCM acceptor to CCM acceptor (tree) . . . . . . 814.8 Conversion from ESCM acceptor to PCM acceptor (string) . . . . 834.9 Conversion from ESCM acceptor to PCM acceptor (tree) . . . . . 854.10 Schema to combine two wrappers . . . . . . . . . . . . . . . . . . . 864.11 CCM acceptors for different extraction tasks on same domain . . . 874.12 A combined CCM acceptor for two different extraction tasks . . . 874.13 A CCM acceptor and an extracting run on a string . . . . . . . . . 904.14 An extracting run on a tree . . . . . . . . . . . . . . . . . . . . . . 914.15 A single run extraction on a string . . . . . . . . . . . . . . . . . . 924.16 A single run extraction on a tree . . . . . . . . . . . . . . . . . . . 934.17 Illustration of the worst case for single run extraction . . . . . . . 944.18 Graph showing extraction time versus document size . . . . . . . . 97

5.1 Forks of a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.2 Another example of a set of forks of a tree . . . . . . . . . . . . . . 1095.3 Fork Set Acceptor . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.4 Minimal Fork Set Acceptor . . . . . . . . . . . . . . . . . . . . . . 1195.5 A small part of the (k, l)-contextual tree acceptor for Example 5.10 1215.6 (k, l)-contextual tree acceptor for Example 5.10 . . . . . . . . . . . 1245.7 Examples of Finite Tree Acceptors . . . . . . . . . . . . . . . . . . 1255.8 Examples of Finite Tree Acceptors . . . . . . . . . . . . . . . . . . 125

6.1 HTML examples with generalized leafs . . . . . . . . . . . . . . . . 1296.2 Parameter Space and Data Representation . . . . . . . . . . . . . . 1376.3 Use case for GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.1 Hierarchy of wildcards . . . . . . . . . . . . . . . . . . . . . . . . . 1477.2 Examples of embedded catalogs . . . . . . . . . . . . . . . . . . . . 1487.3 Example transformations for LUTI . . . . . . . . . . . . . . . . . . 151

8.1 Possible configurations of spanning nodes and target fields . . . . . 162

9.1 Examples of k-clumps of graphs . . . . . . . . . . . . . . . . . . . . 170

List of Tables

3.1 Transition Function of FSA A in Example 3.4 . . . . . . . . . . . . 26

4.1 Timings on RISE data sets . . . . . . . . . . . . . . . . . . . . . . 954.2 Single field versus multiple-field extraction . . . . . . . . . . . . . . 96

6.1 Number of extractions for okra-1 . . . . . . . . . . . . . . . . . . . 1336.2 Number of extractions for bigbook-3 . . . . . . . . . . . . . . . . . 133

7.1 Experimental results comparing ‘pos. examples only’ approaches . 1547.2 Experimental results comparing interactive approaches . . . . . . . 158

8.1 Number of interactions needed to learn a perfect wrapper . . . . . 165

xiii

xiv List of Tables

List of Algorithms

3.1 General Algorithm for String Automata Construction . . . . . . . 313.2 Function getCompositeTransition for the determinization operator 393.3 General Algorithm for Tree Automata Construction . . . . . . . . 493.4 General Automaton Minimization . . . . . . . . . . . . . . . . . . . 583.5 Function getCompositeTransition for the determinization operator 614.1 Function getCompositeTransition for the CP operator (string) . . . 734.2 Function getInitialComposite for the CP operator (tree) . . . . . . 754.3 Function getCompositeTransition for the PC operator (string) . . . 784.4 Function getCompositeTransition for the PC operator (tree) . . . . 804.5 Function getCompositeTransition for the ESP operator (string) . . 824.6 Function getCompositeTransition for the ESP operator (tree) . . . 845.1 learnWrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.2 GetForks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.3 Function getInitialComposite for conversion to an acceptor . . . . 1165.4 Function getCompositeOutputS for conversion to an acceptor . . . 1165.5 Auxiliary Function addToSet . . . . . . . . . . . . . . . . . . . . . 1175.6 Function getCompositeOutputT for conversion to an acceptor . . . 1185.7 Function getCompositeTransition for conversion to an acceptor . . 1196.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1356.2 Learning the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 138

xv

xvi List of Algorithms

Chapter 1

Introduction

Many volumes can be written and have been written about what is called theinformation age. In the course of a few decades computers have gone from awe-inspiring mastodons under the care of dozens of technicians and researchers toevery day appliances. Experimental connections between university computershave grown to a worldwide interconnection of networks, commonly known as theInternet. For most laymen, this term is synonym to the World Wide Web, themost successful or at least most visible application based on the Internet.

Although radio and television started the globalization of information, bringingthe world into the living room, it is the advent of the Web that allowed everybodyto share information from his living room to the rest of the world. The minimaleffort to publish a web page resulted in a tremendous growth of the Web. A studyfrom 2005 (Gulli and Signorini 2005) estimated the size of the Web at over 11.5billion pages. This organic growth was not anticipated at the conception of theprotocols for the Web, and no provisions were made to enable easy retrieval andprocessing of these vast information resources.

The huge and ever increasing amounts of data are prohibitive for a systematicmanual processing. This provided a new motivation for several research domainslike Information Retrieval, aiming to automate the retrieval of pages relevant toa given query, with as example application the automatic creation of indexes bysearch engines, or Information Extraction, aiming to automate the extraction ofdata from web pages into a form that facilitates its automatic processing.

This work aims to improve the state of the art in wrapper induction for in-formation extraction from web pages. In this chapter we introduce the concepts’information extraction from structured text’ and ’wrapper induction’. We alsolist the main contributions of this work and sketch the structure of the text.

1

2 Introduction

1.1 Information Extraction from Structured Text

The aim of information extraction (IE) systems is to extract specific informationfrom a set of human readable documents into a form suitable for computer manip-ulation. This is a challenging field. For some tasks data can be identified basedon a simple pattern surrounding it, for other tasks, a deep understanding of thetext might be necessary. We distinguish two types of IE.

On one hand we have information extraction from unstructured (or free) text.The data is embedded in full sentences within a continuous text. Hence, severaltechniques rooted in the field of natural language processing can be used. Of-ten these are used in a preprocessing step to derive additional information forthe elements of the sentence. Part of speech tagging for example is used to at-tach the probable type of each word (verb, noun, adverb, . . . ), while word sensedisambiguation can be used to retrieve the semantical meaning for the words.

On the other hand, we have information extraction from structured text, likeweb pages. Web pages are written in Hypertext Markup Language (HTML). Inthis format, markup tags are used to indicate the document’s structure (title, para-graphs, bulleted lists, . . . ), or its appearance (textcolor, width, . . . ). Traditionaltechniques used for information extraction from unstructured text often do notwork for structured texts, as the data is often not presented in full sentences, butin tables, lists, . . . , where no semantical relation between parts of the documentcan be recognized.

A web page often contains multiple instances of similar data, or a site containsseveral pages with similar data. And this similar data is presented (structurally)in a similar way. This is especially true when these pages are generated witha script. Scripts are used to automatically generate a web page based on, forexample, the results from a query on a database. Therefore rules based on thestructural context can be found that apply specifically to these similar data. Thisfact allows to extract this data, despite the lack of embedding sentences.

Note that structured text like HTML pages are often referred to as semi-structured data. Semi-structured data can be described, as data with an internalstructure, but such that the type/field of the separate elements is unspecified.Whereas for structured data, it is always clear to what type an element belongs.For example a database file, consists of records, and for every element of the recordthe type and semantics are specified (even though perhaps in an external specifi-cation). In contrast, HTML pages may contain elements with different semanticspresented in the same way.

1.2 Wrapper Induction

The term wrapper is generally used to denote a customized procedure to extractspecific data from a webpage. Typically a wrapper is defined for a limited set of

1.2 Wrapper Induction 3

webpages that originate from the same script. For pages containing similar datafrom a different site, another wrapper is needed. Also when a site changes itslook, or even slightly changes its generating scripts, it is often necessary to createnew wrappers. The manual creation of wrappers is a tedious and error-proneprocess, despite various tools designed to facilitate wrapper building. This drivesthe research interest in the automatic learning of such wrappers.

We distinguish between unsupervised and supervised learning. Unsupervisedlearning means that unclassified examples are provided and that the algorithmtries to distinguish classes itself, based on regularities in the data. For wrapperinduction, web pages are given, without any extra information, and the algorithmtries to guess interesting elements and their classification.

In supervised learning, the algorithm receives data, together with a classifi-cation for this data. This is provided by an oracle, an expert in the domain ofthe learning task at hand. In the case of wrapper induction, an oracle can beany human person with some experience with web pages. Each web page containsmultiple occurrences of the target field. We consider as a single positive example,a page with one of these occurrences marked with the target field. Other examplescan come from different pages, but also from the same page, with different occur-rences marked. A negative example could be a page with an element marked thatis not a target element. Typically the number of possible negative examples thatcan be generated from a page is much larger than the number of positive examplesthat can be generated from that page, and most of the negative examples will notadd any useful information to the induction algorithm. Therefore most wrapperlearning approaches are based on positive examples only. After the learner hasinferred a hypothesis it can be used for extraction. False positives are sensiblecandidates for negative examples and can be exploited to refine the hypothesis.This scheme is used in some interactive systems.

Some approaches expect as a positive example a completely annotated page.When a certain page is completely annotated, i.e., all positive examples for thatpage are given, we have in fact also an implicit collection of negative examples.Indeed, every node that is not marked in one of the positive examples is negative.It is tedious though to annotate pages completely. A more comprehensive studyof different approaches to wrapper induction can be found in (Kosala 2003).

Most approaches have their own way to represent wrappers. These representa-tions vary in expressiveness, from using first order logics to a fixed string patterncontaining wildcards. Given that learning becomes harder for a more expressivelanguage, we want to make a trade off between expressiveness and learnability.Experiments seem to indicate that for practical use the class of regular stringlanguages or even some of its subclasses are sufficiently expressive. We decidedto use finite state automata to represent wrappers, which are able to representany regular string language, because they are a well known technique and havenice theoretical properties, like decidability of emptiness, equivalence or the halt-

4 Introduction

ing problem. Next to string automata, we also use tree automata, an extensionable to represent regular tree languages and therefore more expressive. As stringautomata are slightly more intuitive, we will often explain a new algorithm firstfor string automata. And use this as a setpstone for the same algorithm on treeautomata.

1.3 Contributions

The aim of this work, is to research techniques that allow a practically usablesystem to extract desired information from web pages, and able to learn to do this,based on examples given by a non expert user. To this aim, we developed somegeneral tools, and we worked out their application within information extractionfrom web pages. Below, we summarize our main contributions. More detail onour contributions can be found at the end of each chapter.

• We propose a new representation for tree automata, with several advantagesover previous representations. We present a general framework that allowsto define operations on these automata, and we discuss their minimizationand determinization.

• We show how automata can be used to represent wrappers for informationextraction. We have developed techniques for efficient extraction, and nec-essary conversion operators to handle the wrappers.

• We conceived a technique able to learn a tree language from positive exam-ples only, called (k, l)-contextual tree languages. And we show how to builda tree automaton that accepts a (k, l)-contextual tree language.

• We present a practical approach to learning wrappers, that is based on (k, l)-contextual tree languages. This includes learning the parameters, and theimplementation of an interactive system.

1.4 Overview of the Text

We start with a synopsis of information extraction from web pages in Chapter 2.There we pinpoint the exact interpretation of information extraction we will usein this thesis. The text then continues with two chapters about the representationof wrappers. In Chapter 3, we introduce string and tree automata, and elaborateon their properties, while in Chapter 4, we elaborate upon how to use generalautomata for the representation of wrappers. Subsequently, we tackle the problemof inducing wrappers. We start with the general learning of tree automata inChapter 5, where we introduce the class of (k, l)-contextual tree languages, tolearn from positive examples only. The application of the class (k, l)-contextual

1.4 Overview of the Text 5

tree languages, for wrapper induction, is described in Chapter 6. This chapterfurther explains the practical issues of an interactive system for wrapper induction.In Chapter 7 we go into some detail on some related work, in order to discussthe merits of each approach. We report on an experimental comparison, includingthese approaches in the same chapter. In Chapter 8, we define and evaluate anextension to our system, that allows us to perform sub node extraction. In thefinal chapter, Chapter 9, we summarize the conclusions of our work, and suggestsome avenues for further work.

We visualize the structure of the text in Figure 1.1.

6 Introduction

General Techniques Information Extraction

Rep

rese

ntat

ion

Lea

rnin

gC

ompa

riso

nE

xten

sion

Chapter 2

Definition of I.E.

Chapter 3

Automata

Chapter 4

I.E. with Automata

Chapter 5

(k, l)-Contextual Languages

Chapter 6

I.E. with (k, l)-ContextualLanguages

Chapter 7

Related Work

Chapter 8

Sub Node Extraction

Chapter 9

Conclusions and Further Work

Figure 1.1: The structure of the text

Chapter 2

Information Extraction fromWeb Pages

In this chapter we introduce the different concepts in the domain of informationextraction from web pages. We provide some running examples, and we introducestandard data sets for IE, which we will use for experiments later on.

2.1 Web Pages and HTML

Web pages are coded in a language called HyperText Markup Language (HTML).A markup language is used to somehow add extra information to parts of a text.In HTML this is done by embedding special labels, called tags, into the text. Toset these labels apart from the pieces of regular text, they are surrounded by less-than (<) and greater-than signs (>). To attach extra information to a part of thedocument, a starting tag is put before that part, and an associated end tag is putafter that part. An end tag for a given tag consists of that tag prefixed with aslash (‘/’). To insert information in between two parts of the text, a single tagwithout end tag is placed in between those two parts. Note that a tagged part ofthe document (a part between a matching start and end tag) can consist of othertags, apart from regular text. This means that the tagged part can contain taggedsubparts, and hence form a nested hierarchy.

The extra information added by HTML concerns the structure, or appearanceof the text, or it allows amongst others the use of interactive forms, to embedspecial objects, like figures, into the text, or to use hyperlinks, that allow to jumpfrom one page to another (hence hypertext). Some tags can also contain extraattributes. These are referred to by a name, and are given a value. To insert afigure for example, the name of the file in which the figure is stored should bepassed, and when a link is defined, a reference to the page, to which should be

7

8 Information Extraction from Web Pages

a) <html><body>

<h1>Example</h1>

<a href="p2.html">An other page</a>

</body></html>

b)

Figure 2.1: The HTML code (a) and a screen shot (b) of the small HTML examplefrom Example 2.1.

jumped, has to be provided. In the first case the name of the attribute is src inthe second case it is href. The extra information encoded by HTML is processedin a web browser, an application that visualizes the text alone while taking intoaccount specified layout directives in the tags. This way, the HTML tags arehidden in normal use.

Example 2.1 We give a small example of an HTML document. This documentconsists of a header containing the text ‘Example’, and a link to another page.This link contains the text ‘An other page’ in which the word ‘other’ is accentuatedin bold, and links to a page in a file called ‘p2.html’.

In Figure 2.1.a we show the HTML code of this document. The header is en-closed between the tags <h1> and </h1>. The link is enclosed between the tags <a>and </a>. Note that for the link, the attribute ‘href ’ is given the value ‘p2.html’.The data tagged as a link, contains itself a tagged part. Around the text ‘other’ thetags and are placed to indicate that it should be visualized in bold. Thevisualization of this document in a browser is shown in Figure 2.1.b.

We refrain from giving an exhaustive listing of all HTML tags or a rigorous expla-nation of the specifics of HTML, as the techniques in this work do not make anyuse of the semantics of each separate tag. The different tags are currently used asa set of identifiers without any special meaning. Future extensions might integrate

2.1 Web Pages and HTML 9

‘other’

‘Example’ ‘An’ b ‘page’

h1 a

body

html

Figure 2.2: The tree depicting the tree structure of the document from Example 2.1

some semantics at the cost of being less general. An interesting direction couldbe to treat the table related tags separately, and devise a specialized extractionmethod that takes into account the layout in rows and (especially) columns of atable.

2.1.1 Views

Technically, an HTML document is a long string of characters. For practicalpurposes though, a higher level view is advisable. We list below some commonlyused views on HTML documents.

String of Characters Some general applications or techniques are not aware ofHTML, and therefore the raw view on the document is sometimes used althoughrarely for the application of information extraction from webpages.

Token Sequence An HTML aware token parser will split the raw string of char-acters into separate, meaningful pieces (tokens). These tokens are either HTMLtags (possibly further split up, to put also the attributes into separate tokens),or words in the text (characters between whitespaces or special characters). Theparser assigns to each token, a type and possibly extra information.

In the domain of information extraction for web pages, this view is very oftenused, as it allows the use of well known techniques for sequences. These techniquesdo not have to be aware of HTML, as the tag tokens can be handled as just anothertoken type.

Tree View HTML code is inherently tree structured. A part of the text that istagged (enclosed in a start and end tag), can contain tagged subparts and so on.Hence each tagged part corresponds to a node in a tree, and the subparts of thatpart are child nodes of that node. The text between two HTML tags, is commonlyinterpreted as a single text node. A text node is always a leaf. Figure 2.2 depictsa tree view of the small HTML document from Example 2.1.


The Document Object Model (DOM) is a standard1 that is often used to modelthe tree structure of an HTML document. This standard provides extra function-ality for tree handling, and allows attaching extra information to the nodes. Theattributes for each tag are also available on the nodes of the tree. Note that al-though DOM is standardized, different parsers will not necessarily yield exactlythe same DOM tree, especially for documents with erroneous pages. Historicallyweb browsers have fault tolerance built in, where the HTML code gets fixed beforevisualization. Often various ways to fix a page exist.

The approach presented in this work uses a tree view on the documents.

Rendered View The rendered view is the visualization of an HTML document,as it is presented to the user in a web browser. The different elements each havetheir own position, and characteristics in the layout of the document. This isinformation that is not readily available in a DOM tree.

For Information Extraction applications that use the layout information of theelements, the rendered image itself is not practical. An intermediate model is usedthat associates layout information to each element in a DOM-tree. A candidatemodel is the box-model as specified in the Cascaded Style Sheet standard (CSS).The CSS standard2 is the most prominent method to specify layout directives forelements of an HTML document. Its box-model defines one or more boxes, for eachelement in the DOM tree. To these boxes, the calculated rendering information isattached.

2.2 Information Extraction Task

The specification of extraction tasks varies widely, depending on the type of datathat is offered for processing, the kind of information needed for a specific appli-cation, and which postprocessing steps are planned on the information. For eachextraction task, we can discern one or more fields of interest, often in a mutualrelation. The elements in a document can then be classified according to whichfield they belong to. Below, we introduce some example tasks, and we discusscharacteristics of extraction tasks and different types of extraction tasks.

2.2.1 Extraction Task Examples

We introduce some fictive examples of extraction tasks. The purpose of theseexamples is not only to illustrate what an information extraction task could be,but also to depict the points discussed in this section, and especially to functionas running examples for subsequent chapters. Hence fictive, to allow for small and

1The specification of the DOM standard can be found at www.w3.org/DOM/.2The specification of the CSS standard can be found at www.w3.org/Style/CSS/.

2.2 Information Extraction Task 11

simple examples, such that they and derived wrappers or data can be shown inthe limited space of a single page.

These examples are relevant in the sense that they resemble real world tasks,but that they are simplified in several ways. They lack advertisements, or addi-tional layout elements to make them more fancy. They have a limited numberof different fields. For example, in a real world example, more information iskept, than only which supervisor is assigned to a student, as in the example ofSection 2.2.1.1. Also the number of values per field is kept small.

2.2.1.1 Student List

Assume a university database containing amongst others the data on the PhDstudents in all research groups of the university. To simplify the student admin-istration, the technical support staff has provided an internal web interface thatallows to pose queries like ‘Give for research group X, the list of students’. Thewebserver passes these queries automatically to the database. The result of thisquery is then fed to a script, a small program that automatically generates anHTML document to show the list of students, with for each student the datarelevant for the research group. In our example, this data consists of only thesupervisor of that student. This document is given back by the webserver to thebrowser from which originated the original query, and the browser visualizes theanswer. A possible resulting HTML document is shown in Figure 2.3.a, while ascreenshot of the visualization in a browser is shown in Figure 2.3.b. For thisexample document, the field ‘student’ contains the values ‘Stefan’ and ‘Anneleen’,and the field ‘supervisor’ contains the values ‘Maurice’, and ‘Hendrik’.

A sketch of a situation in which information extraction from web pages isa solution: Suppose an administrator urgently needs a distribution key for somefunding of professors in our research groups. A constraint is that this key definitelyhas to take into account the number of PhD students assigned to each professor.Unfortunately this query is not implemented in the web interface, and the supportstaff has currently other priorities than to tinker with an old script (and grantingdirect access to the database would be a huge breach of privacy, as it containsa lot of sensitive data). An easy solution is to process the data that is availablethrough the website. We devise a wrapper that extracts the field ‘supervisor’, andin the extracted list we can then count the occurrences of each professor to knowhow many PhD students he has.

2.2.1.2 Paper Database

A database of articles can be queried for articles containing a given search term.The query returns a list that contains for each selected article its title and author.This result is converted to HTML by a script and visualized in a browser. In thedocument, each listed title contains a link to the actual article with that name,

a) <html><body><ul><li>name: Stefan</li>

<li>supervisor: Maurice</li></ul><hr><ul><li>name: Anneleen</li>

<li>supervisor: Hendrik</li></ul><hr></body></html>

b)

Figure 2.3: Student List : HTML code (a) and screen shot (b).

a) <html><body><h4>Search results for search term</h4><a>title1</a> <a>author1</a><a>title2</a> <a>author2</a><a>title3</a> <a>author3</a><center>

<a>Prev</a> <a>1</a> 2 <a>3</a> <a>Next</a></center>

</body></html>

b)

Figure 2.4: Paper Database : HTML code (a) screen shot (b).

and each author occurrence contains a link to some biography information for thatauthor. Long lists are split over multiple documents. A schematic example webpage together with the corresponding HTML code is shown in Figure 2.4. Thequery returned a lot of results and is split over three pages, whereof the pageshown is the second. Fields that can be discerned in this example are ‘searchterm’, ‘author’, and ‘title’.

2.2.1.3 Restaurant Guide

A web site running a restaurant guide, allows searching for restaurants based onparts of their name. The resulting list of restaurants is returned as a web pageconstructed from a fixed template. In Figure 2.5, a possible outcome is shown fora search on ‘china’. From this web page we can extract the following fields: thename of the restaurant(N), its type(T), the city(C) where it is located, and a phone

a) <html><body>

Restaurant Guide: search results for china

<a>NewChinaTown (chinese)</a>

BrusselsTel: +32(0)2 345 67 89

<a>RoyalChina(chinese)</a>

LeuvenTel: +32(0)16 61 61 61

<a>ChinaGarden (chinese)</a>

AmsterdamTel: +31(0)20-4321234

</body></html>

b)

Figure 2.5: Restaurant Guide : HTML code (a) screen shot (b).

number(P). For each restaurant we could also extract the URL(L) from the link(leading to more detailed address information). From the top sentence, also thesearch term(S) that generated the page can be found. Note that the occurrence ofthe search term in the name is rendered in italic, while the land code of the phonenumber is in bold.

2.2.2 Domain

An extraction task is typically defined on a set of documents where the informa-tion of interest is organized in a similar way. The set of relevant documents iscalled the domain of the extraction task. For example, a wrapper can be learnedfor extracting the types of the restaurants from the pages from the ‘RestaurantGuide’ website. It makes no sense to apply this wrapper on a document fromthe ‘Student List’ example. But also on pages with similar data, like those fromanother website featuring an own restaurant guide, the wrapper will not yield theexpected results. Indeed, while the latter holds similar data, the internal repre-sentation can be completely different. For extraction tasks from websites withautomatically generated pages, the domain is normally the set of documents that


is generated with the same script.It is not always clear what exactly the domain is, without access to the gener-

ating script. We may have seen a fair sample of pages from the domain, but weare never sure to have seen all possible exceptions. It is possible that the searchengine of the Restaurant Guide website, does not find any matching restaurants.In that case a special page is shown indicating that ‘No results are found for thespecified search term’. The wrapper should recognize this exception and return anempty list, instead of for example claiming that a restaurant exists with the name‘No results’. Or the search engine could on the contrary return a large numberof restaurants. The script might generate a page containing the extra message‘x results found, only first 50 results shown. Please refine your search term.’. Ifno provision for this exception is made (or learned) in the wrapper, it might getderailed and return an empty list. As another example, the Student List pagescould contain a picture for each student. In the case no picture is available for thestudent, an alternative format is used for the entry of that student.

Apart from exceptions in the presented data, the generated pages might differin areas away from the fields of interest. Each page can contain for example arandom advertisement at the top of the page. This way, the beginning of thedocument can be different for each page.

2.2.3 Single Field Extraction versus Tuple Extraction

A set of documents in a given domain normally has values for different fields. Oftenthese fields are in some relation to each other. For example the ‘is student of’relation between the ‘student’ and ‘supervisor’ field. A single field extractiontask, extracts a list of values that belong to a specific field. When extractingdifferent fields separately though, the relation between the elements is lost in thedifferent resulting lists. With a tuple extraction task we denote extraction tasksthat extract multiple fields, and keep track of the connections between the valuesof the different fields. For example, a requirement could be to extract pairs of astudent and a supervisor from a page. Note that the relation between fields mightresult in more complex data structures than tuples.

Alternatively, a tuple extraction task might be split into several single fieldextraction tasks, and the mutual relations between the extracted values couldthen be extracted in a separate post processing step. This possibility is discussedin more detail in Section 9.2.3.

The approach presented in this work, focuses on single field extraction.

2.2.4 Node Extraction versus Sub Node Extraction

In the tree view, all text between two HTML tags ends up in a single text node.The values of a field in a given extraction task though, can correspond to a singlenode in the document tree, a part of a single node, or can be spread over a number


p

a

bi

New ChinaTown (chinese) Brussels Tel: +32 (0)2 345 67 89

(N)ame (T)ype (C)ity (P)hone number

Figure 2.6: A subtree of the document from Figure 2.5, containing the first restau-rant. The different fields are indicated below the text leaves.

of adjacent nodes. In the latter case, the field may but need not correspond to allleaf nodes of a single subtree. We can discern node extraction tasks as those tasksfor which the target values are always a node, and sub node extraction tasks asthose tasks for which at least one boundary of the target values falls in the middleof a text node.

The extraction tasks in the ‘Student List’ and ‘Paper Database’ examples areall node extraction examples. The ‘Restaurant Guide’ example contains bothtypes. In Figure 2.6 we show the tree (only a subtree due to space restrictions)of the document from Figure 2.5, with the target fields indicated beneath. Onlythe ‘City’ field is extracted with a node extraction task, as it is the only one thatoccupies a single text node. For the other fields, sub node extraction is required.The ‘type’ field can be extracted from a single node, while the ‘name’ and ‘phonenumber’ fields are spread over multiple nodes. The ‘search term’ field in the‘Restaurant Guide’ example (not present in Figure 2.6) is also an example of anode extraction task.

Note that a similar distinction could be made for token versus sub token ex-traction tasks, but as target values do usually not contain half a token, the useof characters instead of tokens is an overkill. Indeed, the lower granularity wouldslow down the learning phase while turning it also in a more difficult task.

In this work, the focus is on node extraction tasks. In Chapter 8, the handlingof sub node extraction tasks is discussed in more detail.

2.2.5 Element Extraction versus Value Extraction

A value associated with a certain field, is actually a string representation of apiece of the document. No indication is left from where in the document thevalue originated. For most applications value extraction is sufficient. In someapplications though, it is necessary to know the position of the value for the fieldin the document (for example to which node in the tree it belongs), or, especially inthe case of sub node extraction, where the subsequence of tokens that makes up this

2.3 Data Sets 17

value, starts and ends in the whole sequence of tokens for the document. We usethe term ‘element extraction’ when an element with its position in the documentneeds to be extracted instead of only the value of that element. Typically theresult of an element extraction is a list of either positions in the document, indicesof tokens in the document or pointers to nodes of the document tree (when usinga tree view). The actual value at this position can then be extracted from thedocument, and the position of the extraction is known and can be used to performfurther processing.

An example of an application requiring element extraction is an application inwhich the values of a given field in a document have to be changed. A practicalexample could be that values given in dollars need to be changed into values ineuro. To perform a correct substitution, we do not only need to know the valuesthat need to be replaced (the target values of the given field), but also theirexact position. If only the values are extracted we could try to perform a generalsubstitution within the document, substituting a new value for each occurrenceof one of the extracted values. However, this latter approach might fail when thesame value not only occurs as a target of the given field. For example a page couldlist a product costing 20 dollars, available from a shop with as house number thenumber 20. Replacing the value ‘20’ everywhere in the document by the correctamount of euros will therefore result in a wrong house number.

When a tuple extraction task is performed as separate single field extractiontasks, the post-processing step that constructs the relation information on top ofthe extractions needs also to know the positions of these extractions, as relatedvalues often have the same parent, or are otherwise close siblings.

The approach in this work aims for element extraction.

2.3 Data Sets

Many papers on information extraction from webpages, evaluate their algorithmon some set of well known web sites. Web sites however, are not static. Thelayout (and generating scripts) are prone to change regularly. And because theexact data was never published, it is hard to reproduce these experiments, or runnew experiments to compare different algorithms.

There exists a public repository with fixed data sets taken from real worldweb sites: the RISE repository. More and more authors use these data sets forevaluation. The RISE repository is available at the following URL:

http://www.isi.edu/info-agents/RISE/index.html.

For our experiments we have chosen the WIEN data sets (Kushmerick et al. 1997),the Bigbook data set and the Okra data set, all available at the RISE repository.

Each of these sets contains a set of pages from a given domain, most of themhaving an associated set of annotations for a specific extraction task. The WIEN


data sets have each 10 pages, besides some exceptions that have less. The Bigbookand Okra data sets have respectively 235 and 252 pages. These data sets haveannotations for n-tuples. Because we focus on single field extraction, we have spliteach n-tuple extraction task in n single field extraction tasks. We refer to thesetasks with the name of the original data set and the index of the field in the tuple.Each example is a single page with exactly one of the target values marked. Thetarget concepts are given only through annotated pages. No rules are given fora correct wrapper. Hence we will assume that the annotated pages contain allpossible exceptions, and that a wrapper is correct for the given domain when itcorrectly extracts every annotated page in the data set.

We mainly use the data sets that are effectively annotated in the repository,and some of the extra annotations used in (Muslea et al. 2001). Furthermore, weleft out some data sets, that were hard to represent in the STALKER embeddedcatalog formalism (Muslea et al. 2001). In Section 7.3, we use only those extractiontasks that extract a complete text node. In Section 8.4, we use both node, andsub node extraction tasks. Some fields are contained in the ‘href’ attribute of an‘a’ tag, or the ‘src’ attribute of an ‘img’ tag. In the tree based approach, theHTML-parser associates the attributes to the corresponding node. A trivial stepcan be added to retrieve these values. We decided to leave these tasks out, as theyare skewed in favor of the tree based approach.

2.4 Evaluation Metrics

To measure the quality of a wrapper for a given extraction task, we assume theavailability of an evaluation set. This is a set of completely annotated pages, fromthe same domain as the extraction task, and that are assumed to be representa-tive for that domain. Let T, denote the total number of target elements for theextraction task in the evaluation set.

When we apply the wrapper on the evaluation set, we denote with E, thenumber of extractions made by the wrapper, from the evaluation set. Thoseelements extracted, that are target elements, are called true positives. We denotethe number of true positives with TP. Those elements that are extracted, but arein fact no target element, are called false positives. We denote the number offalse positives with FP. The target elements from the evaluation set that were notextracted by the wrapper, are called false negatives. The number of false negativesis denoted with FN.

The precision(P) of the wrapper on a given evaluation set, is defined as thepercentage of the elements extracted by the wrapper, that is extracted correctly:P = TP/E = TP/(TP+FP). The recall(R) of the wrapper on a given evaluationset, is defined as the percentage of the target elements in the evaluation set, thatis extracted by the wrapper: R = TP/T = TP/(TP+FN).

It is possible for a wrapper to have a good precision and a bad recall, and

2.4 Evaluation Metrics 19

vice versa. For example, given a document containing 100 nodes, with 10 of themtarget nodes for some extraction task (T=10). A wrapper could extract a singlenode (E=1), which is a target node (TP=1). This wrapper has a precision of100%, but a recall of 10%. Another wrapper could extract all 100 nodes fromthe document (E=100), and therefore extract all the target nodes (TP=10). Thissecond wrapper has a precision of 10%, and a recall of 100%. To measure whethera wrapper has both reasonable precision and recall, the F1 score is often used asa fitness criterion. The F1 score is defined as the harmonic mean of precision andrecall: F1=2PR/(P+R).

Chapter 3

Automata

In this Chapter we introduce strings, trees, and finite state automata that interactwith them. Several properties of both string automata and tree automata are dis-cussed. We present general frameworks for the specification and implementationof operations on string and tree automata, and illustrate it on operations as nega-tion, union, intersection, and others. The existence and construction of equivalentminimal and equivalent deterministic automata is discussed.

3.1 Alphabets, Strings and Trees

A symbol is an abstract entity. Common symbols are characters, digits, glyphsor some other identifiers. An alphabet is a finite set of symbols. Examples arethe set of digits, or the set of Roman (or Arabic) characters, or 0, . . . 9, A . . . F,the hexadecimal alphabet, or A1, A2, . . . H8, the set of squares on a chess board.A string over an alphabet is defined as a sequence of symbols from the alphabet.The empty string or sequence is denoted by ε. More formally we define the set ofall possible strings over an alphabet Σ as the smallest set defined by:

Definition 3.1 (Strings over Σ) Σ∗ = {ε} ∪ {es | e ∈ Σ, s ∈ Σ∗}Given the alphabet B = {0, 1}, we see that B∗ contains amongst others thesequences 101010ε, 1101ε and 1010011010ε. As each sequence ends by definitionwith ε, it will be left away from all sequences except the empty sequence, resultingin 101010, 1101 and 1010011010. The length of a string is recursively defined as:

Definition 3.2 (Length of a string) Given s ∈ Σ∗ and e ∈ Σ:length(ε) = 0length(es) = length(s) + 1

The set of strings over an alphabet with a given length is then defined as:

21

22 Automata

Definition 3.3 (Strings of length i) Σi = {s | s ∈ Σ∗ & length(s) = i}The set of strings of length 2 over the alphabet {A,B} for example, is the set{A,B}2={AA,AB,BA,BB}. Given the notion of length of a string, we definethe positions in a string.

Definition 3.4 (Positions in a string) Given s ∈ Σ∗, the set of positions in sis defined as P(s) = {0, . . . , length(s)− 1}.Each position in a string refers unequivocally to an element of that string. Anelement is defined, not only by its value (a symbol), but also by its suffix (theensuing string), and its prefix (a string formed from the preceding symbols). Thevalue of an element at a certain position in a string is defined as follows:

Definition 3.5 (Value of an element of a string) Given p ∈ P(s) and s =es′, with s, s′ ∈ Σ∗ and e ∈ Σ:

s ↓ p = es′ ↓ p ={

e if p = 0s′ ↓ (p− 1) if p > 0 .

Example 3.1 Let Σ={a, b, c}. The set of strings defined as Σ∗, contains thestring ‘cba’, for which length(‘cba′) = 3. The set of positions in this string isP(‘cba′) = {0, 1, 2}, and ‘cba′ ↓ 0 = c and ‘cba′ ↓ 1 = b.

The concatenation of two strings, results in a new string that is made up of thesetwo strings placed after each other:

Definition 3.6 (Concatenation of strings) Given the strings s, w ∈ Σ∗, anda symbol a ∈ Σ:

sw =

w if s = εaw if s = aa(s′w) if s = as′

.

We denote the set of suffixes of strings from a given language L, starting with w,as the left-quotient of L with w:

Definition 3.7 (Left quotient) The left-quotient w\L, of a set of strings L ⊆Σ∗, with a string w ∈ Σ is:

w \ L = {v ∈ Σ∗ | wv ∈ L}.Below we define T (Σ) as the set of trees over an alphabet Σ. This is an inductivedefinition, given that T (Σ)∗ denotes the set of all sequences over the set of treesover Σ, where in the base case s equals the empty sequence.

Definition 3.8 (Trees over Σ) T (Σ) = {f(s) | f ∈ Σ, s ∈ T (Σ)∗}.We usually denote f(ε), where ε is the empty sequence, by f . Note that T (Σ) doesnot contain ε. In contrast with strings, an ‘empty tree’ is not defined.

3.1 Alphabets, Strings and Trees 23

a

b a

b c

b

a

c

b c

Figure 3.1: Graphical representation of the trees a(ba(bc)) and b(a(c)bc)

Example 3.2 Let Σ={a, b, c}. The trees a(b()a(b()c())) and b(a(c())b()c())are both elements of T (Σ). To enhance the readability the empty parentheses arenot written: a(ba(bc)) and b(a(c)bc). Fig. 3.1 shows how these two trees can berepresented graphically.

A tree can be seen as a type of directed graph. Each node (element of the tree)has a symbol associated with it, and a number of outgoing edges, its children.Nodes without any children are called leaves. Each child is a tree itself, a subtreeof the original tree. Every node has exact one incoming edge, except the initialone which has none and is called the root. The node that is the source of thesingle incoming edge of a node, is called that node’s parent. An extra propertyis that the child nodes of a node are ordered. This order allows us to specify foreach node of a tree, the position for that node within the tree. For any tree twe define its positions P(t) as a finite subset of the set N∗

0 of finite sequences ofpositive integers as follows:

Definition 3.9 (Positions in a tree) Given t ∈ T (Σ):ε ∈ P(t)p ∈ P(ti) ⇒ ip ∈ P(f(t1, . . . , tn)) for 1 ≤ i ≤ n.

Alternatively we can refer to a node with its index, the position at which thenode occurs in the description of the tree. This is the number the node gets ifnodes are numbered according to a depth first traversal of the tree. In the treea1(b2a3(b4c5)), the root has, as always, index 1, while the leaves have indices 2,4 and 5. The definition of positions in a tree allows to define a subtree morerigorously. For p ∈ P(t), the subtree of t at position p is t/p, which is by inductiondefined as:

Definition 3.10 (Subtree) Given t ∈ T (Σ) and p ∈ P(t):t/p = t if p = εf(t1, . . . , tn)/p = ti/p

′ if p = ip′

The value of an element at a certain position in a tree is denoted t ↓ p, followingthe definition for strings. Note that t ↓ ε returns the label of the root.

Definition 3.11 (Value of an element of a tree) Given t = f(t1 . . . tn) ∈T (Σ) and p ∈ P(t):

24 Automata

t ↓ p = f(t1 . . . tn) ↓ p ={

f if p = εti ↓ p′ if p = ip′ .

Example 3.3 Given a tree t =

a

b

d e

c , we have P(t) = {ε, 1, 1.1, 1.2, 2},

the subterm t/1 =b

d e, value t ↓ 1 = b, and value t ↓ 1.2 = e.

We define pointed trees, following (Bruggemann-Klein et al. 2001).

Definition 3.12 (Pointed tree) We define a pointed tree as a tree from T (Σ∪{X}) such that exactly one node is labeled by X, and that node is a leaf.

We denote the set of pointed trees over Σ as T (Σ, X). Given t′ ∈ T (Σ, X) andt ∈ T (Σ) ∪ T (Σ, X), the concatenation t′t of t′ and t is the tree obtained byreplacing the node labeled X in t′ with t.

The set of trees defined in Definition 3.8 is often referred to as the set ofunranked trees over Σ. This contrasts with the set of ranked trees, which is basedon a ranked alphabet. An alphabet Σ is called a ranked alphabet when it has arelation (Σ×N), called rank associated with it. While a node in an unranked treecan have an arbitrary number of children, the number of children of a node in aranked tree depends on the rank of the symbol associated with that node. The setof ranked trees over a ranked alphabet is defined as:

Definition 3.13 (Ranked Trees)Tranked(Σ) = {f(s) | (f, i) ∈ rank(Σ), s ∈ Tranked(Σ)i}.

From this definition it is clear that ranked trees are a subset of the trees definedin Definition 3.8, also called unranked trees. Later in this section we define treeautomata. The automata over ranked trees are also a subset of the unranked treeautomata. Unless explicitly indicated otherwise, all trees and tree automata inthis work are unranked.

We also refresh some notions of partitions. The partition P of a set X is adisjoint and complete set of sets over X (P ⊂ 2X). The element of a partition Pthat contains an element x of X is denoted as [x]P or simply as [x]. A partitionP1 refines a partition P2 (P1 v P2) iff ∀A ∈ P1 : ∃B ∈ P2 : A ⊆ B. An equivalencerelation ∼ over X defines equivalence classes on X. These classes form a partitionof X and are called the quotient set; it is denoted as X/ ∼.

3.2 String Automata

Finite state string automata are a classic tool in informatics. We refresh the theoryhere as a base for the extension of the finite state paradigm towards tree automatain the next section.

3.2 String Automata 25

3.2.1 Definitions

In this section we introduce finite state automata and their use as functions fromstrings of symbols from an input alphabet to an output alphabet (Σ∗

i → Σo).We start by introducing the notion of deterministic Finite State String Automata(FSA). A non-deterministic automaton is similarly defined, the only difference be-ing that the next-state function is actually a next-state relation. As we will onlyneed the deterministic version we will present neither non-deterministic stringautomata nor non-deterministic tree automata in this work. We do discuss deter-minization operations in Section 3.2.4 and Section 3.3.6.

Definition 3.14 (Finite State String Automata) A deterministic FiniteState String (or Sequence) Automaton (FSA) is a tuple A = (Σi,Σo, Q, q0, δ, φ)where Σi is a set of input symbols, Σo is a set of output symbols, Q is a set ofstates, q0 ∈ Q is the initial state, δ : (Q × Σi) → Q is a transition function froma pair consisting of a (current) state and an input symbol to a (next) state and φis an output function1 Q → Σo.

To use an FSA as a function from strings to symbols, we first define an extendedtransition function which takes a string of symbols and an initial state, insteadof a single symbol and a state. The result of processing a string with an FSAis then the output associated with the state that is returned by this extendedtransition function when applied on that string and the initial state of the FSA.Note that this is different from a transducer, in that a transducer will return astring of output symbols (namely the concatenation of the output from each of theintermediary states visited to reach the final state), while an automaton returns asingle output symbol.

Definition 3.15 (Extended Transition Function) Given an automaton,A = (Σi,Σo, Q, q0, δ, φ), its extended transition function, δ : (Q × Σ∗i ) → Q, isdefined as δ(q, ε) = q, and ∀w ∈ Σ∗i and a ∈ Σi, δ(q, aw) = δ(δ(q, a), w).

To enhance readability, the transition function is often defined in a table, thougha graphical representation is best for most small examples. In a graphical repre-sentation, states are represented by circles (nodes in a graph), that contain theoutput associated with that specific node. The initial node is indicated with asmall arrow. The transition function is represented with labeled, directed edges.When the transition function leads from a pair (q0, s) to a next state q1, thenthe graphical representation contains an edge starting at q0, labeled with s andleading to q1. The identifier of each state can be placed as a label next to eachnode. They will be omitted though, unless we need to reference a specific state ina discussion of the FSA.

1As the output of the automaton depends solely on the state, this definition conforms to thedefinition of a Moore automaton. In a Mealy automaton the output depends on both the stateand the next input symbol.

26 Automata

δ a b c

q0 q1 q2 q0

q1 q3 q2 q0

q2 q1 q4 q0

q3 q3 q2 q0

q4 q1 q4 q0

Table 3.1: Transition Function of FSA A in Example 3.4

Example 3.4 Given an example automaton A = (Σi,Σo, Q, q0, δ, φ) with

Σi = {a, b, c},Σo = {0, 1, 2},Q = {q0, q1, q2, q3, q4},

δ =

δ(q0, a) → q1, δ(q0, b) → q2, δ(q0, c) → q0,δ(q1, a) → q3, δ(q1, b) → q2, δ(q1, c) → q0,δ(q2, a) → q1, δ(q2, b) → q4, δ(q2, c) → q0,δ(q3, a) → q3, δ(q3, b) → q2, δ(q3, c) → q0,δ(q4, a) → q1, δ(q4, b) → q4, δ(q4, c) → q0

, and

φ = {φ(q0) → 0, φ(q1) → 0, φ(q2) → 0, φ(q3) → 1, φ(q4) → 2}.The transition function in this example is shown in Table 3.1. The graphicalrepresentation of A is shown in Fig. 3.2. To process the string ‘acb’ with theautomaton A, we start initially in the state q0. Given the first input a, the statebecomes q1. Processing the next symbol, c, leads back to the state q0, and with bwe end up in the state q2. Therefore the result of processing the string ‘acb’ is0, the output associated with q2. Other examples are ‘caa’ which results in 1 and‘aacbbb’ which results in 2. The run on this last string is illustrated in Figure 3.3.After each step, the current state is shown, succeeded by the part of the string thatremains to be processed. Given a little thought we see that this automaton willreturn 1 for every string that ends with several a’s, 2 for every string that endswith several b’s, and 0 for all other strings.

As a special class of FSA, we distinguish the set of FSA’s called Acceptors. Thesehave as output set, a set of two symbols, where one symbol indicates acceptanceand the other rejection. States that have the accepting symbol as output are calledaccepting states, the others rejecting states. Often acceptors are defined withoutan output function, having instead a set of final states to identify the acceptingstates.

Definition 3.16 (Finite State String Acceptor) A deterministic FiniteState String Acceptor is an FSA A = (Σi,Σo, Q, q0, δ, φ), with Σo={accept, re-ject}.


Figure 3.2: Graphical representation of FSA A in Example 3.4

0 a a c b b b 1 a c b b b 3 c b b b

0 b b b2 b b4 b4

Figure 3.3: An illustration of the run of the FSA from Figure 3.2 on the string‘aacbbb’.

28 Automata

An acceptor can be used to define a (possibly infinite) set of strings over an alpha-bet. Such a set contains exactly the strings that are accepted by the acceptor, inother words, those strings for which the state, returned by applying the extendedtransition function of the acceptor on that string, has as output the acceptingsymbol. It can be proven that the sets defined by all possible finite state stringacceptors on a certain alphabet can be mapped one to one onto the set of reg-ular languages defined on that alphabet, and on the sets defined by the regularexpressions over that alphabet.

Example 3.5 An acceptor that is equivalent to the regular expression ‘ab*c’ ispresented in Fig. 3.4.b. Note that for acceptors we can use a simplified graphicalrepresentation. The rejecting states are represented by a dot, while the acceptingstates are represented by a dot surrounded by a circle.

We denote the regular language defined by an acceptor A, by L(A):

Definition 3.17 (Language of a string acceptor) Given a string acceptor A= (Σi,Σo, Q, q0, δ, φ), L(A) = {s ∈ Σ∗i | φ(δ(q0, s)) = accept}.For automata that are not acceptors, the strings that result in a specific outputalso define a regular language. We can therefore generalize this definition as:La(A) = {s ∈ Σ∗ | φ(δ(q0, s)) = a}, with a ∈ Σo.

A dead state is a rejecting state that has no path leading to an accepting state.Once a run of an acceptor reaches a dead state, the final result will always berejection, regardless of the remainder of the string.

Definition 3.18 (Dead State) A state q is a dead state ⇐⇒ ∀w ∈ Σ∗i :φ(δ(q, w)) = reject.

Till now we assumed that given a state and an input, a transition to a next statewas defined. These are called complete automata. A complete automaton is anautomaton that has in every state a transition defined for every element of theinput alphabet. Automata for which no transition is defined for some state - inputpairs are called incomplete. In this text we assume that for every state - input pairfor which a transition is missing, a transition to a dead state (or a state designatedas dead state in case of a non acceptor) is implicitly defined. This assumptionmakes it possible to convert every incomplete automaton to a complete one. InFig. 3.4.a and b, an incomplete and a complete version of the automaton fromExample 3.5 are shown.

The graph representing an automaton can be divided in two: the set of statesreachable from the initial state, and the set of states that can not be reached fromthe initial state. These latter are called the unreachable states. As the graph isa directed graph, the unreachable states are not necessary part of disconnectedsubgraphs. It is possible that reachable states can be reached from unreachableones, but not the other way round. When the automaton is used to process strings


a) b) c)

Figure 3.4: Incomplete(a) and complete(b) FSA’s, with unreachable states(c)

it is clear that an automaton from which the unreachable states are removed isstill equivalent with the original one. In Fig. 3.4.c an automaton with unreachablestates, conforming to the definition in Example 3.5, is shown.

As is illustrated in Fig. 3.4, the same regular language can be accepted bymultiple different automata. These automata are equivalent.

Definition 3.19 (Equivalent Automata) Two automata, A1 and A2, areequivalent: A1 ≡ A2 ⇔ ∀s ∈ Σ∗i : φ1(δ1(q0, s)) = φ2(δ2(q0, s)).

This definition depends solely on the reachable states. The unreachable statescannot influence the equivalence relation. They can be removed or added arbi-trarily. Therefore we will always use those automata with only reachable states.We define also an equivalence relation on states. Equivalent states are states thatare indistinguishable what concerns their output for different strings.

Definition 3.20 (Equivalent states) Given an automaton A, two states p andq are equivalent: p ≡A q ⇔ ∀s ∈ Σ∗i : φ(δ(p, s)) = φ(δ(q, s)).

When the states p and q are reachable states, it is by definition possible to findstrings s1 and s2, such that δ(q0, s1) = p and δ(q0, s2) = q. Therefore Defini-tion 3.20 can be rewritten as:

Definition 3.21 (Equivalent Reachable States) p ≡A q ⇔ ∀s, s1, s2 ∈ Σ∗i :δ(q0, s1) = p and δ(q0, s2) = q ⇒ φ(δ(q0, s1s)) = φ(δ(q0, s2s)).

3.2.2 String Automata Operations

In this section we discuss the implementation of standard operations on stringautomata within a general framework, which will allow us to easily define newoperations in subsequent sections and chapters. We consider operations thattake one or more existing string automata, and result in a new automaton. For

30 Automata

many operations this result can be described as a composite automaton Ac =(Σi,Σo, Qc, q0c, δc, φc), whose states (∈ Qc) are represented by composites con-taining states of the original automata. Let Op be an operation that takes a listof operations and constructs a new automaton, i.e. Ac = Op(list). It is clearthat this operation is completely defined given the following three functions (withc ∈ Qc and a ∈ Σi):

q0c = getInitialComposite(list)δc(c, a) = getCompositeTransition(c, a, list)φc(c) = getCompositeOutput(c, list)

Hence, given these functions, we can execute the resulting automaton withoutconstructing it. In the remainder of this work we will use these functions todefine a new operation. Below, we will illustrate the use of these definitions forwell known string automata operations like union, intersection, concatenation, . . .But we start with introducing a general algorithm to construct the result of anoperation given these three defining functions.

3.2.2.1 A General Construction Algorithm

When we construct the result of an operation, there exists a one-to-one corre-spondence between the states of the new automaton (Q) and the states of thecomposite representation of that result (Qc). We define a mapping ‘map’ to holdthis relation. The function map.add(q, c), with q ∈ Q and c ∈ Qc, adds the pair(q, c) to the mapping, such that map.getState(c) = q.

To start, the algorithm (Algorithm 3.1) creates a new initial state, and asso-ciates this state with the initial composite state that is returned by the functiongetInitialComposite. From the initial state, the algorithm traverses all possibletransitions, to get to all reachable states. The linked composite representation isused to calculate the transition from a composite state, given an input symbol,to another composite state. If there does not yet exists a state in the automatonunder construction that is associated with this latter composite state, a new stateis created and added to the automaton. The new state - composite state pair isthen added to the mapping. The input alphabet of the new automaton (and ofthe composite representation) is the union of the input alphabets of the differentautomata being passed to the operator.

3.2.2.2 Copy and Negation

The simplest operation is to copy an existing FSA. The composite states are rep-resented by the states of the original automaton itself. The getInitialCompositefunction returns the initial state from the single automaton in the operand list.The function getCompositeTransition returns the result of the original transitionfunction for the given original state and input symbol. The getCompositeOutputfunction returns the output of the given original state. Note that this operation


Algorithm 3.1 General Algorithm for String Automata ConstructionInput: A list of operands OpList and the functions defining the operation.Output: The resulting automaton.1: map = empty mapping2: initial=getInitialComposite(list)3: newState = new state4: map.add(newState, initial);5: agenda.push(initial)6: while agenda not empty do7: state=agenda.pop()8: for all symbol from the input alphabets from the original FSA’s do9: next = getCompositeTransition(state, symbol, list)

10: if next != nil then11: if map.getState(next) = undefined then12: newState = new state13: map.add(newState, next)14: setOutput(newState, getCompositeOutput(next, list))15: agenda.push(next)16: end if17: addTransition(map.getState(state), symbol, map.getState(next))18: end if19: end for20: end while

32 Automata

does not always yield an exact copy, but results in an equivalent automaton with-out any unreachable state. Implementing the negation operation is similar, thegetCompositeOutput function changes such that it swaps the accept and rejectsymbols. For the negation operator though (and many other operations), the origi-nal automaton should be complete to get a correct result. If the original automatonis incomplete, there is no need to do an explicit conversion to a complete version. Itis possible to incorporate this conversion in the function getCompositeTransition.When the original transition function returns ‘nil’ (no transition) for a specific in-put, it is treated as the implicit dead state, and as such used in the composite statethat is returned. Note that for incomplete automata, the input alphabet cannotalways be derived from the automaton. Therefore the input alphabet should begiven explicitly. This is illustrated in Example 3.6.

Example 3.6 Given an automaton that accepts all strings consisting of the sym-bol ‘a’ only (Figure 3.5.a). When Σi = {a}, this automaton is complete, and thenegation will accept only the empty sequence (Figure 3.5.b). When Σi = {a, b},this automaton is incomplete (a complete version is shown in Figure 3.5.c). Thenegation will accept the empty sequence and every string that contains at least one‘b’ (Figure 3.5.d).

a) b) c) d)

Figure 3.5: An example automaton (a), and its negation (b) for Σi = {a}. itscomplete version (c) and its negation (d) for Σi = {a, b}.

3.2.2.3 Union and Intersection

To check whether a string is an element of the union or intersection of the setsdefined by two automata, we can run those automata in parallel and see whetherone of the outputs of the resulting states is accepting (union) or whether bothoutputs are accepting (intersection). To create an actual automaton that checksthe union/intersection directly, we simulate this parallel execution in the compositerepresentation. A composite state is represented by a pair of states, the firstelement is the state within the first automaton, the second element is the statethat the second automaton has reached during the parallel run. The result of


the getInitialComposite function is the pair containing the initial states from thetwo original automata. The function getCompositeTransition returns a new pairof states, with as first element the state which is the result from the transitionfunction from the first automaton applied to the first element of the given pairand the given input symbol. The second element is found analogously with thetransition function from the second automaton, and the second element of the givenpair. The getCompositeOutput function is the only difference between union andintersection in that it returns accept only when at least one of the two states isan accepting state for the union operation, and that it returns accept only whenboth states are accepting.

a) b) c)

Figure 3.6: Two example automata(a), their union(b) and its composite represen-tation(c)

Example 3.7 As an example we illustrate the construction of the union of twosmall automata. The first automaton accepts all strings consisting of a multiple oftwo a’s, not including the empty string (or that accepts the strings defined by theregular expression (aa)+). The second automaton accepts all strings which consistof a multiple of three a’s (or the strings defined by the regular expression (aaa)∗).These automata are shown in Fig. 3.6.a and their union in Fig. 3.6.b. An initialstate for the automaton under construction is created, and the pair (0,0), returnedas the initial composite state, is pushed on the agenda. In the loop, the compositestate (0,0) is popped from the agenda, and getCompositeTransition results in thecomposite state (1,1) for symbol ‘a’. As this is the first time this composite stateis encountered, a new state for the automaton is created, and the composite stateis pushed on the agenda. This continues until composite state (2,0) is processed.The transition from (2,0) with symbol ‘a’ results in the composite state (1,1). Thispair has already an existing state linked to it, thus map.getState is defined, and nonew state is pushed on the agenda. The agenda is empty now, and the algorithmfinishes.

In this example, the agenda did always contain exactly one element, hence theorder of processing the states was fixed. When states have multiple transitions,the agenda will sometimes contain more than one entry. The order of process-

34 Automata

ing depends then on the implementation of the agenda, but is irrelevant for thealgorithm.

3.2.2.4 Concatenation

A string is accepted by the concatenation of two automata when that string canbe split into two substrings, such that the first substring is accepted by the firstautomaton and the second substring is accepted by the second automaton. It ispossible that, from all ways to split the string in two, there are multiple ways forwhich the first substring may be accepted by the first automaton. To check whethera string is accepted by the concatenation of the two automata, it is therefore notsufficient to check only the first split found, for which the initial part is accepted bythe first automaton. After every initial substring, accepted by the first automaton,the remainder of the string should be processed by the second automaton. If thesedifferent runs are run in parallel, the second automaton will end up in differentstates depending on the split. If one or more of these states are accepting states,this is an indication that at least one split can be found for which the two substringsare accepted by the respective automata.

For the composite representation of the concatenation, we use the combinationof a single state from the first automaton, and a set of states from the secondautomaton. This set represents the states reached after running the second au-tomaton from the different starting points (marked by acceptance by the firstautomaton). The function getInitialComposite returns the pair consisting of theinitial state of the first automaton and either an empty set, when the former is arejecting state, or a singleton containing the initial state of the second automaton,otherwise. The result of getCompositeTransition is a new composite state, withas single state, the transition defined by the first automaton for the given singlestate and the given input symbol. The set in the new composite state containsthe states resulting from the transitions defined by the second automaton for thedifferent states from the set in the given composite state and the given symbol.When a transition defined by the second automaton results in a dead state though,it is not added to the new set. Additionally, when the new single state is an ac-cepting state, the initial state of the second automaton is also added to the newset. The getCompositeOutput function returns accept for a given composite state,only when at least one of the states from the set is an accepting state. The singlestate is ignored by the output function.

Example 3.8 In Fig. 3.7 the concatenation of two example automata is shown.This new automaton accepts strings with a length larger than four, which consisteither of an odd number of ‘a’s followed by a single ‘b’, or an even number of ‘a’sfollowed by a single ‘c’. Thus the string ‘aaaaaac’ is accepted. During the run onthis string, we reach the composite state (2,{0,2}) after processing the substring‘aaaa’. The different elements of this composite state reflect the resulting states


a) b)

Figure 3.7: Two example automata (a) and the composite representation of theconcatenation of the bottom one to the top one (b). The automata in (a) areincomplete (the dead state is not shown explicitly). In (b) we use X to representthe dead state of the initial automaton.

depending on the different choices that could be made during processing the previoussymbols: either the first split was chosen (‘aa’-‘aaaac’) and the second automatonis now in state 2, or the second split was chosen (‘aaaa’-‘aac’) and the secondautomaton is now in state 0, or no split is yet chosen and the run is in state 2 ofthe first automaton.

3.2.2.5 Iteration

The iteration of an automaton accepts the empty string and each string that isa concatenation of any number of strings accepted by that automaton. As thisdescription indicates, this operation is closely related to concatenation. Each timean accepting state is reached, the automaton is concatenated to itself.

The composite state is now one single set of states. The getInitialCompositefunction returns the singleton containing the initial state of the original automaton.The result of getCompositeTransition is a new set containing the states resultingfrom the transitions applied on the states from the given set and the given symbol(dead states are again ignored). And when the new set contains an accepting state,the initial state is also added to the new set. The getCompositeOutput functionreturns accept when the set contains at least one accepting state.

Example 3.9 An example to illustrate the iteration of an automaton is given inFig. 3.8. The original automaton accepts the strings ‘ab’, ‘ac’, and ‘aba’. Itsiteration accepts the empty string and every other string constructed from thebasic strings, like ‘acabab’, ‘abaababac’, . . . . These strings can also be infinite,e.g. ‘ababababab. . . ’.

36 Automata

a) b)

Figure 3.8: An example automata(a) and the composite representation of its iter-ation(b).

3.2.3 Minimization of String Automata

Minimization of an automata is defined as finding for that automaton the equiv-alent automaton with a minimal number of states. The study of finite state au-tomata for regular string languages and their minimization dates back to the 50’s(Moore 1956). The main motivation to obtain a minimal equivalent automaton isthe implementation cost in electronics, or the constraints of small memory devices.

We discuss briefly the classic proof of the existence of a minimal equivalentstring automaton, in order to relate to this proof when discussing its extension totrees. We also make an abstraction over different existing minimization algorithms.

3.2.3.1 Existence of a Minimal Set of String States

An equivalence relation between input strings can be defined for a given automa-ton. Two strings are equivalent when they are interchangeable as prefix in anyinput string. This equivalence relation is called right congruence and can be re-lated with equivalence of reachable states.

Definition 3.22 (Right Congruence) Given a string automaton A, then twostrings s1 and s2 (s1, s2 ∈ Σ∗i ) are equivalent (s1 ∼A s2) if and only if ∀s ∈ Σ∗i :φ(δ(q0, s1s)) = φ(δ(q0, s2s)).

The relation between right congruence and the equivalence of states, is easilyproven from the respective definitions (Definition 3.22 and Definition 3.21).

Proposition 3.1 ∀s1, s2 ∈ Σ∗i : s1 ∼A s2 iff δ(q0, s1) ≡A δ(q0, s2)

For string automata, the existence of a minimal set of string states, is proven asa corollary of the proof of the Myhill-Nerode theorem. The partition defined bythe congruence relation Σ∗i / ∼A, places all interchangeable strings in the samequotient set. It is shown this partition defines a finite set of equivalence classes,and is isomorphic to the set of states of the minimal equivalent automaton.

For our purposes, it is useful to connect the left quotients of the languagedefined by the acceptor, and the right congruence defined by that acceptor. Wedo this by the following proposition.


Proposition 3.2 ∀s1, s2 ∈ Σ∗i : s1 ∼A s2 iff s1 \ L(A) = s2 \ L(A)Proof The sets s1 \ L(A) and s2 \ L(A) are equal, if and only if, ∀s ∈ Σ∗i : s ∈s1 \ L(A) ⇒ s ∈ s2 \ L(A) and ∀s ∈ Σ∗i : s /∈ s1 \ L(A) ⇒ s /∈ s2 \ L(A). GivenDefinition 3.7, we can rewrite this as ∀s ∈ Σ∗i : s1s ∈ L(A) ⇒ s2s ∈ L(A) and∀s ∈ Σ∗i : s1s /∈ L(A) ⇒ s2s /∈ L(A). With Definition 3.17, this becomes ∀s ∈Σ∗i : φ(δ(q0, s1s)) = accept ⇒ φ(δ(q0, s2s)) = accept and ∀s ∈ Σ∗i : φ(δ(q0, s1s)) =reject ⇒ φ(δ(q0, s2s)) = reject. Because the output alphabet of an acceptorcontains only accept and reject, this collapses into: ∀s ∈ Σ∗i : φ(δ(q0, s1s)) =φ(δ(q0, s2s)). Which is, given Definition 3.22, equivalent to s1 ∼A s2. �

3.2.3.2 State Minimization

A taxonomy of minimization algorithms for finite string automata can be found in(Watson 1994). With algorithms ranging in complexity, from O(n2) to O(n log n).Most algorithms (Moore 1956; Huffman 1964; Hopcroft 1971; Aho et al. 1986;Blum 1996) start from a partition with a single equivalence class, being the setof all states, and perform consecutive refinements until the equivalence partitionis reached. These algorithms can be defined in terms of two basic refinementoperators. Differences between algorithms are due to the order in which the basicoperators are performed, their aggregation in larger operators, the stop criterionused, and the data structures used to select the next refinement.

A basic refinement operator aims at splitting a class of the partition into twoclasses. A first operator is applied when there is a class with output evidence:Several states in the same class that output a different symbol. A second operatoris applied when there is a class with transition evidence: Several states in the sameclass that lead to states in different classes given some input symbol.

Definition 3.23 (Refinement Operator: Output Evidence) Given a par-tition of states P , containing a class [q] with output evidence. Ro(P, [q]), thepartition returned by the operator is defined as: Ro(P, [q]) = P \ {[q]} ∪ {{p ∈[q] | φ(p) = φ(q)}, {p ∈ [q] | φ(p) 6= φ(q)}}.Definition 3.24 (Refinement Operator: Transition Evidence) Given apartition of states P , containing a class [q] with transition evidence for inputsymbol e. Rt(P, [q], e), the partition that is returned by the operator is definedas: Rt(P, [q], e) = P \ {[q]} ∪ {{p ∈ [q] | [δ(p, e)] = [δ(q, e)]}, {p ∈ [q] | [δ(p, e)] 6=[δ(q, e)]}}.Note that both operators effectively split the class [q] in two classes when evidencefor that class is present. Some algorithms combine several operations to split aclass in multiple (more than two) classes. The Rt operator cannot create newoutput evidence. Therefore, all algorithms apply Ro operations in a first phaseand Rt operations in a second phase of the algorithm. This first phase is actuallycreating a partition of the states according to their output.

38 Automata

3.2.3.3 Input Minimization

We define two input symbols as equivalent, when they can be swapped in everystring without changing the result of the automaton for that string.

Definition 3.25 (Equivalence of Input Symbols) Given a string automatonA, two input symbols x and y are equivalent (x ≡iA y) if and only if ∀s1, s2 ∈ Σ∗i :φ(δ(q0, s1xs2)) = φ(δ(q0, s1ys2)).

Input minimization consists of replacing the inputs by sets of equivalent inputs.This will not alter the number of states in the automaton, only the number oftransitions. The minimization algorithm will iteratively refine the set of all inputsuntil the equivalence partition is reached, i.e. when no more effective refinementscan be found.

Definition 3.26 (Refinement Operator for Input Evidence) Given a par-tition of input symbols I that contains a class [e]I with input evidence for state q.Ri(I, [e]I , q), the partition returned by the operator is defined as: Ri(I, [e]I , q) =I \ {[e]I} ∪ {{a ∈ [e]I | [δ(q, a)] = [δ(q, e)]}, {a ∈ [e]I | [δ(q, a)] 6= [δ(q, e)]}}.Applying Ri does not result in new output or transition evidence, hence inputminimization can be performed as a third step after the state minimization.

Note that replacing input symbols by sets of input symbols can require anextra preprocessing step on the input strings. On the other hand, an applicationcan also natively use sets of symbols as symbols. For example the input alphabetof an automaton could consist of the set of strings over a given alphabet (hencethe input is a string of strings). As this set is infinite, the input symbols have tobe represented as sets of strings, possibly with string acceptors.

3.2.4 Determinization of String Automata

In a non-deterministic automaton, the transition function δ is no longer a function,but a transition relation. This relation can relate multiple next states to a singlestate-symbol pair. For example in Figure 3.9.a, the transition from state 1 giventhe input symbol a leads to either state 2 or to state 3, and the transition fromstate 2 given input symbol b results in either state 2 or state 4. We will representthe transition relation of a non-deterministic automaton as a function that returnsa set of possible next states: (Q× Σi) → 2Q.

A classic result in automata theory states that for every non-deterministicautomaton, there exists an equivalent deterministic automaton. The traditionalproof is a constructive one. An algorithm is provided that creates an equivalentdeterministic automaton for a given non-deterministic automaton. We will givethis algorithm within the operation framework from Section 3.2.2, i.e., we pro-pose a composite representation and provide the functions needed by the generalconstruction algorithm.

3.3 Tree Automata 39

To run the non-deterministic automaton in a deterministic fashion, we will notchoose a specific resulting state, when a transition leads to multiple states, butwe will follow each of the options in parallel. To simulate this behavior in thecomposite representation, we represent each composite state as the set of statesreached in the parallel threads. The function getInitialComposite(list) returns thesingleton containing the initial state of the original non-deterministic automaton.The function getCompositeOutput(c, list) returns accept when at least one of thestates in c has accept as output. The function getCompositeTransition(c, a, list)returns a set containing the union of the transitions of each element of c, giventhe symbol a. This function is shown in Algorithm 3.2.

Algorithm 3.2 Function getCompositeTransition for the determinization opera-torInput: A composite state c, an input symbol a and a list: (And), containing the

original non-deterministic automaton.Output: The next composite state.1: next = ∅2: for all state ∈ c do3: next = (next ∪ δnd(state, a)) \ {nil}4: end for

Example 3.10 The composite representation being the result of the determiniza-tion of the non-deterministic automaton shown in Figure 3.9.a, can be found inFigure 3.9.b. Starting from the composite state {1} we see that the algorithm wentto the composite state {2, 3} given symbol a, as the non-deterministic automatonleads to either state 2 or state 3 from state state 1, given symbol a. To find thetransition from the composite state {2, 3} given symbol b, we get the transitionfrom state 2, being {2, 4}, and the transition from state 4, being {}, and we taketheir union. This results in {2, 4}, the next composite state. Given a, the statesof {2, 3} lead respectively to {4} and {3}. Hence the resulting composite state is{3, 4}. In Figure 3.9.c the final result after minimization is shown.

3.3 Tree Automata

Whereas finite string automata process strings, finite tree automata operate ontrees. We give the general definition of tree automata, and propose a new rep-resentation for the transition function, and compare it and its minimization toexisting representations. Operations on tree automata are again addressed in ageneral framework.

40 Automata

a) b) c)

Figure 3.9: A non-deterministic automaton(a), the composite representation of itsdeterminization, and the minimal equivalent deterministic automaton(c)

3.3.1 Definitions

Analogous to string automata, we use tree automata as functions from trees overan input alphabet to an output alphabet (T (Σi) → Σo). A string automatoncalculates the state of a sequence by calculating the state of the prefix of the lastsymbol and then using the transition function to find the state resulting from theprefix state and the last symbol. Similarly, a bottom-up tree automaton calcu-lates the state of a tree by applying the transition function associated with theroot symbol on the sequence of states of the children. The evaluation of a treerecursively descends that tree to calculate first the states of the leafs and the lowertrees, before evaluating the tree itself, hence the name bottom-up. The automatonis deterministic when the evaluation of a tree results in a unique state.

Definition 3.27 (Finite State Tree Automata) A deterministic bottom-upFinite state Tree Automaton (FTA) is a tuple T = (Σi,Σo, Q, δ, φ) where Σi isa set of input symbols, Σo is a set of output symbols, Q is a set of states, φ isan output function Q → Σo and δ : (Σi × Q∗) → Q is a transition function froma pair consisting of an input symbol and a sequence of (child) states to a (next)state, such that for each a ∈ Σ and q ∈ Q, the set {w ∈ Q∗ | ((a,w) → q) ∈ δ} isa regular set of strings over the alphabet Q.

For tree automata we also extend the transition function to a function that returnsthe state for a given tree. This function applies the extended transition functionδ recursively on the children of the root. The result of applying an automaton ona given tree t is defined as φ(δ(t)).

Definition 3.28 (Extended Transition Function) The extended transitionfunction of an FTA, δ : T (Σi) → Q, is defined as δ(f(s)) = δ(f,map(δ, s)),with f(s) ∈ T (Σi). The function map(func, seq) returns the sequence obtained byapplying the function func on each element of seq.

Definition 3.27 is equivalent to that in (Bruggemann-Klein et al. 2001) for bottom-up tree automata. However, they use a set of final states instead of an output


a b a

b a

a

1 b a

b a

a

b a

2 a

a

2 a

2 a

a

2 1

2 a

a

2 1

a1

Figure 3.10: Processing a tree by a tree automaton.

function; so they only consider tree acceptors (see Definition 3.29). Our definitionmaps each tree on an element of Σo; this divides the trees into a language La(T )for each a ∈ Σo, hence t ∈ La(T ) iff φ(δ(t)) = a. Their results carry over to ourformalism. In particular, each language La(T ) is tree-regular. Below we give thedefinition of a finite state tree acceptor based on Definition 3.27. Unless explicitlymentioned otherwise, all tree automata in this work will be deterministic andbottom-up.

Definition 3.29 (Finite State Tree Acceptor) A deterministic Finite StateTree Acceptor is an FTA T = (Σi,Σo, Q, δ, φ), with Σo= {accept, reject}.

The accepted regular tree language is then:

Definition 3.30 (Language of a tree acceptor) Given a tree acceptor T =(Σi,Σo, Q, δ, φ), L(T ) = {t ∈ T (Σi) | φ(δ(t)) = accept}.

Example 3.11 Contemplate a tree acceptor that accepts all trees constructedwith the symbols ‘a’ and ‘b’, starting with the symbol ‘a’ as root, in which thelabels of subsequent siblings alternate between ‘a’ and ‘b’, such that the label of thefirst child differs from that of the parent. An example tree from this language isshown on the left of Figure 3.10.

This acceptor has {a, b} as input alphabet, {0, 1, 2} as set of states, and has asoutput function φ = {0 → reject, 1 → accept, 2 → reject}. From the transitionfunction we only show the transitions needed to accept the example tree: δ ={(a, ε) → 1, (b, ε) → 2, (a, 21) → 1, (b, 1) → 2, . . .}.

In order to process the example tree, its subtrees have to be processed first. Thisis done recursively. In Figure 3.10 we show the different steps in the processing,by replacing every processed (sub)tree with the obtained state. The final state forthe whole tree is 1, an accepting state, hence the tree belongs to the language.

Note that trees in this language can have an arbitrary large number of children.Implying that the total number of possible transitions in the transition function isinfinite.

Dead tree states are defined similar to dead string states. A state is dead if andonly if every subtree ending in that state and all trees containing that subtree arerejected.

42 Automata

Definition 3.31 (Dead State) A state q is a dead state ⇐⇒ φ(q) = reject

and ∀t ∈ T (Σi) : δ(t) = q,∀t′ ∈ T (Σi, X) : φ(δ(t′t)) = reject.

Two automata are equivalent when they produce the same output for every pos-sible input.

Definition 3.32 (Equivalent Automata) Two automata T1 and T2 are equiv-alent: T1 ≡ T2 ⇔ ∀t ∈ T (Σi) : φ1(δ1(t)) = φ2(δ2(t)).

For reachable tree states, i.e. states p such that there exists a tree t such thatδT (t) = p, we can define equivalence in a way similar to Definition 3.21, namelythat two states are equivalent when they are interchangeable.

Definition 3.33 (Equivalence of Reachable Tree States) Given a tree au-tomaton T , two reachable states δT (tp) and δT (tq) are equivalent (δT (tp) ≡TδT (tq)) iff ∀t ∈ T (Σi, X) : φT (δT (ttp)) = φT (δT (ttq)).

3.3.2 Representation of the Transition Function

Contrary to string automata and ranked tree automata (Comon et al. 1999), thenumber of different (a, w) pairs in the transition function of a tree automatonis infinite, hence δ cannot be defined by enumeration. The representation of thetransition function is therefore not trivial.

We present a representation based on a single string automaton, that has the setof tree states as both input alphabet, and output alphabet. Hence this automatonwill map a sequence of tree states onto a single tree state (this is possible becauseaccording to Definition 3.27, these sequences belong to a regular set). We furtherextend the definition of an FSA, such that it has multiple initial states, instead ofa single one, together with a function α that maps an input symbol a of the treeautomaton into an initial state α(a) of the FSA. Given that the string automatonAT = (QT , QT , QS , α, δS , φS) represents the transition function2 δT of the treeautomaton T , the result of this function for a given pair (a, w) is calculated asδT (a,w) = φS(δS(α(a), w). The tree automaton will be deterministic if and onlyif AT is deterministic.

Example 3.12 Given the tree language from Example 3.11. There is anotherexample tree from this language shown in Figure 3.11.a. A complete and an in-complete that accept this language are shown graphically in Figure 3.11.b and c.

The string states of AT , the automaton that represents δT , are shown as nodesof a graph. The transitions between these states are represented by edges betweenthese nodes. The tree state that triggers a particular transition is indicated as alabel of the representing edge. The output (tree state) of a string state is shown

2 Transition functions and output functions of tree automata will be given a subscript T ,those of string automata used to represent the transition function a subscript S.


b

a b b a b

b a b a

a

a) b) c)

Figure 3.11: a) An example tree from the language described in Example 3.12, b)a complete FTA and c) an incomplete FTA accepting that language.

inside the node. For finite tree acceptors, the output function is also representedgraphically: the string states that output an accepting tree state get a double circle.For other output alphabets, the output function can be represented by a separateoutput table. The function α, mapping input symbols on an initial state, is visual-ized by placing the input symbol next to the associated initial state. Note that treestate ‘0’ in Figure 3.11.b is a dead tree state, and the central string state in Fig-ure 3.11.b is a dead string state of the representation function. In Figure 3.11.c,these dead states are left out (implicit).

Example 3.13 In Figure 3.12.a another tree acceptor is shown. Figure 3.12.bshows the bottom-up run of this automaton on an example tree. The run startsat the leftmost leaf, having a tag ‘b’. The initial state associated with this symbolreturns the tree state ‘1’. One level higher, this state is used as the input for thetransition function that starts from the initial state associated with the symbol ‘c’.The subtree starting with ‘c’, returns the state ‘4’. Similarly, the leaves of itssibling are converted to respectively the tree states ‘1’ and ‘3’. This sequence leadsfrom the initial state associated with ‘b’, to a state that also returns a tree state‘4’. And finally, the symbol ‘a’, together with the sequence of tree states (4,4),leads to tree state ‘5’, which is an accepting state. Hence the tree is accepted.

3.3.3 Alternative Representations

An alternative representation for the transition function is given in (Kosala et al.2003; Neven 2002). With each pair of an input symbol a and a possible resultingstate qT , a string acceptor is associated that accepts every tree state sequencew, for which holds that δT (a,w) = qT . As the transition relation is assumedto be a function, the different acceptors associated to a given input symbol willbe mutually exclusive. To evaluate the transition function for a given symboland a sequence of tree states, each of the acceptors associated with the pairs

44 Automata

bbd

c b

a

1 bd

c b

a

bd

4 b

a

1 3

4 b

a

4 4

a5

a) b)

Figure 3.12: A finite tree acceptor (a), and an accepting run of this automaton ona tree (b).

containing that symbol has to be evaluated on the sequence. The result of thetransition function is then the tree state from the pair with the sole acceptorthat accepts the sequence. This is clearly less efficient than the representationproposed in Section 3.3.2. Also the constraints for a deterministic automaton aremuch more complex. Each of the acceptors have to be deterministic, and theacceptors associated with the same input symbol, have to be mutually exclusive.In Figure 3.13, an example of this alternative representation is shown graphically.The associated pair is shown before each string acceptor.

A second alternative representation for the transition function can be foundin (Cristau et al. 2005; Raeymaekers and Bruynooghe 2004). In this representa-tion, each input symbol a is associated with a separate FSA. This FSA is usedas a function from Q∗

T to QT . It returns for a sequence of states w the state qT

for which δT (a,w) = qT (i.e., with qa the initial state of the FSA associated withinput symbol a, φSa

(δSa(qa, w) = δT (a,w)). The tree automaton is deterministic

when each string automaton is deterministic. Figure 3.14 shows an example of anFTA in this representation. The graphical representation uses the same principlesas our representation, except that dotted lines are added to separate the differentstring automata.

This alternative representation can be derived from the previous one by calcu-lating the union of the automata associated to the same input a, with their outputfunction modified, such that the output symbol accept of the acceptor associatedwith the pair (a, qT ) is replaced by qT , and the output symbol reject with thedead tree state.

The disconnected FSAs in this representation of the transition function canalso be interpreted as a single FSA (e.g. we could remove the dotted lines in Fig-ure 3.14). This allows for a simple conversion to our representation, and it shows


φT (QT→Σo)

1 → reject2 → reject3 → reject4 → reject5 → accept

Figure 3.13: An alternative representation for the transition function of the FTAfrom Figure 3.12, using a string acceptor for each input symbol - tree state pair.The output function φT of the FTA is shown in a separate table.

that this alternative representation can be seen as a subclass of our representa-tion. As the two classes are equally expressive, those FTA’s that are not in theintersection of the two classes, have an equivalent FTA in the two classes. Thishas implications for the equivalent automaton with a minimal number of states.Our representation allows for a more compact automaton (see Section 3.3.5).

Stepwise Tree Automata (STA) (Carme et al. 2004; Martens and Niehren 2006)are conceived as the canonical automaton notion for the algebra of unranked trees;this algebra is isomorphic to the term algebra using the curry encoding of unrankedtrees as binary trees.

In this formalism, the tree automaton and its transition function are interwo-ven. By interpreting the single set of states as string states of a transition functionhaving an output function that returns the same state, but now interpreted as atree state, STA’s can be seen as yet another alternative representation for the tran-sition function that is also based on a single FSA. In other words, the transitionfunction is defined as an FSA AT = (QT , QT , QT , α, δS , φS), where a single set ofstates represents both the tree states and the states of the FSA.

Therefore this representation is also a subclass of our representation (withQS = QT ). In (Martens and Niehren 2006), it is proven that this restrictiondoes not alter the expressiveness: an equivalent STA exists for every unrankedtree automaton. Again there are implications for the equivalent automaton witha minimal number of states (see Section 3.3.5). The restriction on tree and stringstates, adds an extra constraint on the design of operations, making them morecomplex (see Section 3.3.4).

An example of a stepwise tree automaton interpreted as an unranked treeautomaton is shown in Figure 3.15. Note that every string state has a differentoutput. This allows each string state to be designated by its own output, a 1 on

46 Automata

Figure 3.14: An alternative representation for the transition function of the FTAfrom Figure 3.12. Each input symbol has a finite string automaton assigned, tomap sequences of tree states (from the children) to a resulting tree state.

1 mapping as ordered by the definition. As states 0,4,6, and 10 are never usedin any transition, and are all rejecting, they are all dead tree states, and henceequivalent tree states (not equivalent string states). Also states 5, 7, and 9 areequivalent tree states, as they always occur on the same transitions. More on thisin Section 3.3.5.

3.3.4 Tree Automata Operations

Similar to string automata operations (Section 3.2.2), the result of many tree au-tomata operations can be represented as a composite automaton with compositetree states based on the original automaton or automata. The transition func-tion of the result can then be represented as a composite string automaton withcomposite string states based on the string states of the original automaton orautomata. This string automaton has the set of composite tree states as bothinput and output alphabet.

Let Op be an operation that takes a list of operations and constructs a newautomaton, i.e. Tc = (Σi,Σo, QTc, δTc, φTc) = Op(list), where δTc defined byAT c = (QTc, QTc, QSc, αc, δSc, φSc). The operation is then completely defined bythe following functions (with cT ∈ QTc, cS ∈ QSc, and a ∈ Σi):

αc(a) = getInitialComposite(a, list)


Figure 3.15: A stepwise tree automaton equivalent to the FTA from Figure 3.12,represented as an unranked tree automaton.

δSc(cS , cT ) = getCompositeTransition(cS , cT , list)φSc(cS) = getCompositeOutputS(cS , list)φTc(cT ) = getCompositeOutputT (cT , list)

The first three functions define the string automaton representing δTc. Notice thatin contrast to regular string automata (Section 3.2.2), getInitialComposite has anextra parameter to allow for the multiple initial states used in our representation.A fourth function getCompositeOutputT is added to model the output function forthe tree states. Below we introduce a general construction algorithm for tree au-tomata based on these functions, analogous to the general construction algorithmfor string automata, and we illustrate it by defining a number of tree automataoperations.

3.3.4.1 A General Construction Algorithm

To recapitulate: a tree automaton in our representation is a string automatontaking tree states as input, and returning a tree state as output; the output beingthe state reached for the current subtree, while the inputs are the states reachedby the children of that tree. The α function processes the labels of the tree nodes(the actual input alphabet of the tree automaton) and returns a specific startstate, allowing for different results based on the label of the tree node (even whenthe children are identical). The general tree automata construction algorithm istherefore similar to the general string automata construction algorithm. It willtry out every possible transition for every possible composite string state (startingfrom the initial state(s)), and add eventually a transition to the correspondingstate of the automaton under construction. This is done in the while loop ofAlgorithm 3.3. Each iteration, a state is taken from the top of the agenda. OnLine 25, for each composite tree state is checked, whether they trigger a transitionfrom that composite state (with tryInput). If so, a transition is added, and if thecomposite state is encountered for the first time, a new string state is created to

48 Automata

be added to the final result.The difference with the general string automata construction algorithm lies

herein that when a composite string state is processed, its output is calculated, andif this output is a new composite tree state, a new tree state is created. Hence theinput alphabet of the string automaton (the set of tree states) is augmented duringthe algorithm. It is therefore possible that for an already processed string state,new transitions can be found when new tree states are created. The algorithmtherefore keeps track of a set of processed string states (done), and when a newtree state is encountered, each of these states is checked for a transition triggeredby the new tree state (see Line 19).

For clarity we have added a suffix to those variables in Algorithm 3.3 thatcontain a state. This suffix is an S or a T depending on whether the state is astring or a tree state. For variables containing states from the composite represen-tation the suffix is in capital letters, while for those containing states of the newlyconstructed automaton the suffix is in lowercase letters.

3.3.4.2 Copy and Negation

For the copy operation, the composite string and tree states are again the stringand tree states of the original automaton. The final set of composite tree statesis therefore known in advance, and a more efficient algorithm than Algorithm 3.3can be used. In Example 3.14 though, this algorithm is used for copying an FTA,as this simple operation allows for a clear illustration of the algorithm. For thenegation operation, the getCompositeOutputT function negates the output of theoriginal tree states, and the other functions take care of possible dead string ortree states.

Example 3.14 In this example we illustrate Algorithm 3.3, by describing a par-tial run of the copy operator on a tree automaton. The tree automaton in thisexample accepts exactly one tree. A graphical representation of this tree and theautomaton is given below. For this automaton holds that Σi = {a, b}, QT ={A,B,C,D,X}, and QS = {0, 1, 2, 3, 4, 5}. The string states are indicated as la-bels below the nodes that represent these string states. Note that state ‘0’ is animplicit dead string state, and ‘X’ is a dead tree state.

a

b b

a Σi = {a, b}QT = {A,B,C,D,X}QS = {0, 1, 2, 3, 4, 5}

During initialization, the algorithm places the composite states representing theinitial states on the agenda, and creates a copy for each of them. The initialstates of the original automaton are 1 for symbol a, and 4 for symbol b. As thestates in the composite representation for the copy operator are the same as for the


Algorithm 3.3 General Algorithm for Tree Automata ConstructionInput: A list of operands OpList and the functions defining the operation.Output: The resulting automaton.1: map = empty mapping2: done = ∅3: treestates = ∅4: for all symbol from input alphabets of the original FTA’s do5: initial S=getInitialComposite(symbol, list) // αc(symbol)6: agenda.push(initial S)7: newState s = new string state8: map.add(newState s, initial S);9: add (newState s, symbol) to the α function of the new automaton

10: end for11: while agenda not empty do12: current S=agenda.pop()13: output T = getCompositeOutputS(current S)14: if map.getState(output T ) = undefined then15: newState t = new tree state16: map.add(newState t, output T )17: setOutput(newState t, getCompositeOutputT(output T ))18: treestates = treestates ∪ output T19: for all state S from done do20: tryInput(state S,output T )21: end for22: end if23: state s = map.getState(current S)24: setOutput(state s, map.getState(output T ))25: for all input T from treestates do26: tryInput(current S, input T )27: end for28: done=done ∪ current S29: end whileProcedure: tryInput(input T ,state S)1: next S = getCompositeTransition(state S, input T , list)2: if next S != nil then3: next s = map.getState(next S)4: if next s = undefined then5: next s = new string state6: map.add(next s, next S)7: agenda.push(next S)8: end if9: addTransition(map.getState(state S), map.getState(input T ), next s)

10: end if

50 Automata

original automaton, these states are pushed onto the agenda (as pictured below),while copies 1’ and 4’ are created as initial states for the new automaton (andadded to its α function).

agenda = {1, 4}done= {}treestates = {}

In the first iteration over the while loop, state 1 (composite representation) ispopped from the agenda. Its output A (getCompositeOutputS) is encountered forthe first time, hence a copy A’ is made, and associated as output to the state 1’.And A is added to the set of composite tree states. State 1 is checked for anytransition given A, but none exists. Finally state 1 is added to the set of processedstates (done).

In the second iteration, state 4 is popped from the agenda. Its output B isadded to the set of composite tree states, and a copy B’ is added to state 4’. AsB is encountered for the first time, there is a check whether it triggers transitionsfrom the processed state 1 (in ‘done’), but none exists. In the second for looptransitions starting from 4 are checked (for A and B in ‘treestates’). A transitionexists for A to state 5. As 5 is encountered for the first time a copy 5’ is addedto the new automaton, and state 5 is pushed onto the agenda. Also a transitionis added from 4’ to 5’, given A’. At the end of this loop, 4 is added to the set ofprocessed states. The accessed part of the composite representation is shown below.

agenda = {5}done= {1, 4}treestates = {A,B}

The third iteration processes state 5, and encounters C for the first time. Whenchecking the already processed states (1, 4) for a transition given the new inputC, a transition from state 1 to the yet unseen state 2 is found. Hence 2 is pushedonto the agenda, a copy 2’ is made and the transition from 1’ to 2’, given C’, isadded to the new automaton. No transitions from 5 are found, and 5 is added tothe processed states. This leaves the algorithm in the state shown below.

agenda = {2}done= {1, 4, 5}treestates = {A,B,C}

This process continues until the agenda is empty, and a copy of the original au-tomaton is generated.


3.3.4.3 Union and Intersection

Similar as for string automata, we can simulate the union or intersection of twotree automata, by running them in parallel. At every point of the evaluation,the simulation keeps track of the current string state for each of the automata.But next to pairs of string states, there is for every evaluated subtree, a pair ofresulting tree states. This tree state pair will be the input to go to the next statepair in the simulated transition function.

More formal we define a composite string state for the union/intersection oftwo automata T1 and T2 as a pair of string states (q1, q2), such that q1 ∈ QS1

and q2 ∈ QS2, where QS1 and QS2 are the sets of string states of the transitionfunctions of respectively T1 and T2. A composite tree state is a pair of tree states(j1, j2), such that j1 ∈ QT1 and j2 ∈ QT2. The composite automaton (union) isthen defined as:

αc(a) = (α1(a), α2(a))δSc((q1, q2), (j1, j2)) = (δS1(q1, j1), δS2(q2, j2))φSc((q1, q2)) = (φS1(q1), φS2(q2))φTc((j1, j2)) = φT1(j1) ∨ φT2(j2).

For the composite automaton for the intersection, φTc is defined as:φTc((j1, j2)) = φT1(j1) ∧ φT2(j2).

Example 3.15 Figure 3.16.a shows a tree automaton that will accept every treefor which every node has label ‘a’, and for which every leaf has an even numberof ancestors. Figure 3.16.b shows a tree automaton that will accept every tree forwhich every node has label ‘a’, and all nodes have an even number of children.

The composite automaton that represents the union of these two automata isshown in Figure 3.16.c.

3.3.5 Minimization of Tree Automata

A study (Martens and Niehren 2006) of the minimization of tree automata withdifferent representations for their transition function shows that for some rep-resentations (the first alternative representation in Section 3.3.3), the minimalautomaton exists but is not unique, while minimization is NP-complete. For otherrepresentations (the second alternative representation in Section 3.3.3, and STA),a unique minimal automaton exists and is computable in polynomial time.

In this section, we prove, independently of the representation of the transitionfunction, that a minimal set of tree states for a tree automaton exists and that it isunique. We discuss equivalence properties of our representation of the transitionfunction and we propose a (polynomial time) minimization algorithm for treeautomata using that representation. We end with a comparison and experimentalevaluation of minimization, given different transition function representations.

52 Automata

a) b)

c)

Figure 3.16: Two example automata(a)(b), and the composite representation oftheir union(c).


3.3.5.1 Existence of a Minimal Set of Tree States

For FTAs, equivalence between input trees is called top congruence (Brugge-mann-Klein et al. 2001). This is similar to right congruence for string automata.

Definition 3.34 (Top Congruence) Given a tree automaton T , two trees t1and t2 (t1, t2 ∈ T (Σi)) are equivalent (t1 ∼T t2) iff ∀t ∈ T (Σi, X) : φT (δT (tt1)) =φT (δT (tt2)).

The following proposition relates top congruence between trees with equivalenceof tree states.

Proposition 3.3 ∀t1, t2 ∈ T (Σi) : t1 ∼T t2 ⇔ δT (t1) ≡T δT (t2)

We can associate with each state, the set of strings (or trees) that evaluateto that state. The elements in these sets are interchangeable: A prefix (subtree)can be replaced by another one in the same set without affecting the output ofthe automaton. Elements of the sets associated with equivalent states, are alsointerchangeable. The partition defined by the top congruence relation T (Σi)/ ∼T ,effectively places all interchangeable trees in the same set.

For string automata it has been proven, as a corollary of the proof of theMyhill-Nerode theorem, that this partition is isomorphic to the set of states of theminimal equivalent automaton. We give a similar proof for tree automata.

Proposition 3.4 Given a tree automaton T . Automata equivalent to T existthat have a unique minimal set of tree states.Proof Let [t] denote the class containing t in T (Σi)/ ∼T and La the treelanguage that is mapped on the output symbol a (φT (δT (t)) = a). Define Tmin

as (Σi,Σo, Qmin, δmin, φmin) in which Qmin = T (Σi)/ ∼T and, for all f ∈ Σi andti ∈ T (Σi), δmin(f, [t1] . . . [tn]) = [f(t1 . . . tn)] and φmin([t]) = a ⇔ t ∈ La. In(Bruggemann-Klein et al. 2001) it is proven (Theorem I, part of the extension ofthe Myhill-Nerode theorems to tree languages) that Tmin is well defined (Qmin isfinite and δmin is definable by regular sets of strings) and recognizes a tree-regularlanguage L (Tmin maps each t to the correct La).

Proposition 3.3 shows that T (Σi)/ ∼T and QT / ≡T are isomorphic. Hence,for a given T , there exists a surjective homomorphism from QT to Qmin and everyautomaton has at least as many states as Tmin. In other words, Tmin is minimaland unique up to a renaming of the states. �

Hence the minimal set of tree states for a given tree automaton can be foundby constructing the equivalence partition of its tree states. Note that the mini-mal number of tree states is independent of the representation of the transitionfunction.

As an illustration, Fig. 3.17 shows that the minimal automata for our repre-sentation, and the two first alternative representations, all have 5 states (the dead

54 Automata

state of the first alternative representation is not visible in the figure). The STAin Fig. 3.17.d cannot merge all tree states in the equivalence partition, due to therequirement that QT = QS . The number of tree states in this minimal STA istherefore larger than in the other representations.

3.3.5.2 Equivalence Properties of the Transition Function

In our representation, the transition function is a single FSA, having the sameset of tree states, as both input and output alphabet. The output of a stringstate can, by definition, be replaced by an equivalent tree state. Therefore we canreformulate Definition 3.20 as:

Definition 3.35 (Equivalence of states in a transition function) Given astring automaton AT , two states p and q are equivalent (p ≡AT q) iff ∀s ∈ QT

∗ :φS(δS(p, s)) ≡T φS(δS(q, s)).

The proof that a unique minimal equivalent automaton for this specific FSA existsand that the set of its states is isomorphic to the equivalence partition of the statesof the original FSA still holds. Similarly, Definition 3.25 can be reformulated.

Definition 3.36 (Equivalence of input symbols) Given the string automa-ton AT , two input symbols p and q, from QT , are equivalent (p ≡iAT q) iff∀s1, s2 ∈ QT

∗,∀a ∈ Σi : φS(δS(α(a), s1ps2)) ≡T φS(δS(α(a), s1qs2)).

Next, we show that two tree states are equivalent iff they are equivalent inputsymbols of the transition function and have the same output function.

Proposition 3.5 Given a tree automaton T with all states reachable, δT (tp) ≡TδT (tq) iff δT (tp) ≡iAT δT (tq) and φT (δT (tp)) = φT (δT (tq)).Proof Given Definition 3.33, it suffices to show that∀t ∈ T (Σi, X) : φT (δT (ttp)) = φT (δT (ttq)) (1)

iffδT (tp) ≡iAT δT (tq) ∧ φT (δT (tp)) = φT (δT (tq)) (2)

The equality (1) holds iff it holds for both t = X and t 6= X. For the firstcase (t = X), note that Xtp = tp and Xtq = tq. Hence we obtain: φT (δT (tp)) =φT (δT (tq)) (3)

For the second case (t 6= X), t being a pointed tree different from X meansit is of the form t′a(v1Xv2) with t′ a pointed tree. As we have to consider allpossible t different from X it means we have to perform universal quantifica-tion over t′, a, v1. and v2. Note that ttp and ttq are of the form t′a(v1Xv2)tp =t′a(v1tpv2) and t′a(v1Xv2)tq = t′a(v1tqv2) respectively. Hence we obtain for thesecond case: ∀v1, v2 ∈ T (Σi)∗,∀a ∈ Σi,∀t′ ∈ T (Σi, X) : φT (δT (t′a(v1tpv2))) =φT (δT (t′a(v1tqv2))).


a)

φT (QT→Σo)

1 → reject2 → reject3 → reject4 → accept

b)

c) d) e)

Figure 3.17: Examples of (equivalent) FTAs, using different representations fortheir transition function. Automaton a) uses the first alternative representation,automata b) and e) both use the second alternative representation, automatonc) uses our representation, and automaton d) uses the STA representation. Theautomata a),b),c), and d) are the minimal equivalent automata for their respectiverepresentations. Note that b),d),e) can also be seen as non minimal examples ofautomata using our representation.

56 Automata

We can reformulate this as: ∀v1, v2 ∈ T (Σi)∗,∀a ∈ Σi,∀t′ ∈ T (Σi, X) :φT (δT (t′t′p)) = φT (δT (t′t′q)) where t′p = a(v1tpv2) and t′q = a(v1tqv2).

Using the equivalence of Definition 3.33, we can rewrite this as : ∀v1, v2 ∈T (Σi)∗,∀a ∈ Σi : δT (t′t′p) ≡T δT (t′t′q) where t′p = a(v1tpv2) and t′q = a(v1tqv2) or∀v1, v2 ∈ T (Σi)∗,∀a ∈ Σi : δT (t′a(v1tpv2)) ≡T δT (t′a(v1tqv2)).

According to the definition of tree automata, it holds that δT (a(v1tpv2)) =φS(δS(α(a),map(δT , v1) δT (tp) map(δT , v2))) and similarly for δT (a(v1tqv2)). Weget: ∀v1, v2 ∈ T (Σi)∗,∀a ∈ Σi : φS(δS(α(a),map(δT , v1) δT (tp) map(δT , v2))) ≡TφS(δS(α(a),map(δT , v1) δT (tq) map(δT , v2))).

Reachability of all states implies we can replace map(δT , v1) and map(δT , v2)with respectively s1 and s2, and quantify over all s1, s2 instead of over all v1, v2.This replacement results in ∀s1, s2 ∈ QT ,∀a ∈ Σi : φS(δS(α(a), s1 δT (tp) s2)) ≡TφS(δS(α(a), s1 δT (tq) s2)).

Using Definition 3.36, the latter is equivalent to: δT (tp) ≡iAT δT (tq) (4).The conjunction of (3) and (4) is equal to (2), hence we are done. �

3.3.5.3 Minimization of Tree Automata

Given a tree automaton, we consider the set of all equivalent tree automata. Thisset is partitioned into classes of automata accepting the same set of trees. Foreach equivalence class, we can clearly find the equivalent tree automata with theminimal transition function. Because all automata in the same class have a fixedset of tree states, we can simply use FSA minimization on the transition function.

Proposition 3.4 states, that there exists an equivalent automaton, with a min-imal set of tree states, QTmin. From the previous paragraph follows that thereexists a tree automaton with a unique, minimal transition function, out of allequivalent tree automata with QTmin. We now state that this transition is mini-mal over all sets of tree states:

Proposition 3.6 The minimal transition function for the set of minimal treestates, QTmin, has the minimal number of string states out of all transition func-tions in all equivalence classes.

Proof We prove this by contradiction. Suppose a transition function with lessstates exists in an equivalence class associated with QT 6= QTmin. In that minimaltransition function, we can replace each tree state of QT , by the tree state fromQmin that is associated with the equivalence class (in the partition of tree statesdefined by the equivalence relation) of that state. By definition of equivalent treestates, this results in a tree automaton with the same minimal number of stringstates and a minimal number of tree states. This proves that a transition functionwith less states can not exist. �


a) b)

Ro Rt Ro(PT ) Ro(PS)

Ri(PT )

Rt(PS)

Figure 3.18: Dependency graphs for refinement operators: a) for FSA’s and b) forFTA’s. For tree automata, we indicate whether the operation splits the partitionof tree states (PT ) or the partition of string states (PS).

From this follows that our minimal tree automaton is unique and is based on theequivalence partitions of both the tree states and the string states of the transitionfunction.

To obtain these two minimal sets of states, we start with a single class of treestates, and a single class of string states, and we split a class when there is evidencethat a class contains two non equivalent states. According to Proposition 3.5 twotree states are equivalent if and only if they are input equivalent in the FSA andif they produce the same output. Violation of one of these conditions providesevidence that they are not equivalent.

Violation of the first condition of Proposition 3.5 leads to input evidence. Hencewe use the Ri operator to split a class in the partition of tree states. For the FSArepresenting the transition function of the tree automaton though, the set of treestates is output alphabet as well as input alphabet. Hence, splitting a class in thepartition of the input alphabet can generate new output evidence for the partitionof the string states. On the other hand, application of operators Ro and Rt on thestring states can result in new input evidence. Hence the three operators have tobe applied until all evidence has disappeared.

Now, let us return to the other condition of Proposition 3.5. Two tree states arenot equivalent when they produce different output. When such output evidence ispresent, the operator Ro (applied on the equivalence classes of tree states) can beused to split the classes. As the three other operations cannot create new outputevidence for tree states, the processing of output evidence on tree states can bedone in a first phase of the minimization algorithm. In Figure 3.18, an arrow fromone operator to another indicates that the first operator can provide evidence touse the second operator. An overview of these dependencies is given for both stringautomata minimization, and tree automata minimization.

All these considerations lead to the algorithm depicted in Algorithm 3.4. Thenumber of steps in this algorithm is bounded by the sum of the number of states inQS and QT . Hence the algorithm runs, similar to string minimization algorithms,in polynomial time.The function α of the minimized automaton maps the input symbols on the equiv-alence classes of the original initial states. Merged initial states imply that their

58 Automata

Algorithm 3.4 General Automaton Minimization1: PT := {QT }; PS := {QS}2: while output evidence in PT do PT := Ro(PT , q)3: while output or transition evidence in Ps or input evidence in PT do4: select evidence and apply corresponding operator:5: PS := Ro(PS , q) or PS := Rt(PS , q, e) or PT := Ri(PT , e, q)6: end while7: create new automaton with PT and PS .

associated input symbols are equivalent. Hence, in our representation, the inputminimization of the tree automaton comes for free.

Example 3.16 Figure 3.19 shows a possible run of Algorithm 3.4 on the FTAfrom Figure 3.17.e. Executing Line 2, output evidence is found for state 1 (theoutput of state 1 ∈ [1] is reject, while states in the same class exist, having acceptas output: states 6, 7, and 8). After refining PT with the Ro operator, no moreoutput evidence is found in PT , and the algorithm continues to Line 3. Outputevidence is found in PS (s1 returns a state from [0], while states in the same classreturn states from [6]), which leads to a refinement of PS, with the Ro operator.Splitting PS leads to transition evidence (s1 ∈ [s1] goes with 1 to [s5] = [s1], whilefor example, s5 ∈ [s1] leads with 1 to [s11]), provoking an Rt operation. Inputevidence (s6 leads with 1 ∈ [1] to [s0], while s6 leads with 4 ∈ [1] to [s12]) callsfor refinement of PT with the Ri operator. The order of the last two operationcould have been reversed; no order on picking evidence is prescribed. The finalpartitions shown in Figure 3.19, are isomorphic with sets of the states from theFTA in Figure 3.17.c.

3.3.5.4 Comparison and Experimental Evaluation

We have seen that every automaton that uses the second alternative representationor the STA representation can also be considered to be in our representation.Hence the minimal automata that belong to these classes cannot be smaller thanthe minimal automaton computed by our algorithm. As the example of Figure 3.17

86420

7531

-Ro(PT ,1)

s12s9s6s3s0

s13s10s7s4s1

s14s11s8s5s2

-Ro(PS ,s1)

86420

7531

-Ri(PT ,s6,1)

s12s9s6s3s0

s13s10s7s4s1

s14s11s8s5s2

-Rt(PS ,s1,1)

s12s5s6s3s0

s13s10s7s4s1

s14s11s8s9s2

-. . .

86420

7531

-. . .

s12s5s6s3s0

s13s10s7s4s1

s14s11s8s9s2

86420

7531

Figure 3.19: A possible run of Algorithm 3.4 on the FTA from Figure 3.17.e


Figure 3.20: On the left a minimal automaton in both our representation and thesecond alternative representation. On the right, an equivalent automaton, minimalin the STA representation.

shows, they can be larger. The minimal automaton with the second alternativerepresentation can as well be smaller as larger than the minimal STA automaton.The proof in (Martens and Niehren 2006) stating that STA is always smaller,only proves that the number of string states is smaller or equal than the minimalautomaton for the second alternative representation. A counter example in whichthe second alternative representation yields an equal number of string states, butless tree states is the automaton that accepts all trees constructed with a singlesymbol ‘a’, in which each node is either a leaf or has two children. This automatonis shown in the two representations in Figure 3.20.

The minimal automaton in the first alternative representation can be largerthan in our representation (as in the example of Figure 3.17), but can also besmaller (see the example with disjoint acceptors in (Martens and Niehren 2006)).Recall that it is not computable in polynomial time. Note also that, even though aminimal set of tree states, is also defined in the first alternative representation, theminimal equivalent automaton for this (based on the sum of number tree stateswith the number of string states) might have more tree states than this minimalset.

The argument that for an STA, where each string state is also a tree state,the total number of states should be halved, resulting in the smallest possiblenumber of states, we deem not really valid. For the representation of the treeautomata, this implies that less space is used, because the output function forthe string states uses no space. But this gain in space, is lost again, because theSTA representation results in extra transitions. More important though, is thatafter all, space constraints are not the driving motivation for the minimization oftree automata, as they are for string automata. We are not aware of applicationsof tree automata needing an implementation in electronics, and for use on a PC,memory constraints are not that big a concern.

In applications like automata induction, the speed of operations on the au-tomata is more important. Having minimal tree automata as input speeds up theoperation itself, and keeps the resulting tree automata small (less time neededfor the minimization of the result). As an example, we point to the operator asdescribed in Section 5.4.2.2. The composite representation for string states forthat operator, is made up of both original string and original tree states. Havinga minimal set of tree states definitely results in equivalent composite string statesto be merged together during the operation itself instead of during a minimization

60 Automata

afterwards.Hence, because the string states and the tree states are used in different ways,

these different uses benefit when for each type the minimal set is used.The system for wrapper induction, described in Section 6.4.2 uses internally

various tree automata operations. After every operation we perform a minimiza-tion of the result. Running some induction tasks resulted in a set of automata,with associated minimizations. In a sample of 894 automata, we have 50 automataof size greater than 100. The total size of the whole set was 43025 (10815 treestates and 32210 string states). Using our representation, the total size of theminimal automata is 17106 (5384 tree states and 11722 string states), a reductionto 39% (resp. 49% and 36%). Using the second alternative representation, theminimal representations have size 17507 (5384 tree states and 12123 string states,a reduction to 40% (resp. 49% and 37%). With the STA representation, the re-duction is much smaller, we obtain (counting all dead states in an automaton asa single one) size 20532 (8810 tree states and 11722 string states, a reduction to47% (resp. 81% and 36%).

3.3.6 Determinization of Tree Automata

Like non-deterministic string automata, non-deterministic tree automata will havea transition relation instead of a transition function. Given our representation ofthe transition function, we can discern different ways for the transition function tobe non-deterministic. Either the transition function δS leads to multiple differentstring states for a given tree state as input, or the function α leads to multipleinitial states for a given input symbol, or when a final string state is reached, theoutput function φS can return multiple different resulting tree states. For non-deterministic automata, we will represent these functions such that they returnsets instead of single elements. The transition function δS returns sets of stringstates (δS : (QS × QT ) → 2QS ), the function α returns sets of (initial) stringstates (α : Σi → 2QS ), and the output function φS returns sets of tree states(φS : QS → 2QT ).

Also for every non-deterministic tree automaton there exists an equivalent de-terministic tree automaton. We can prove this in the same way as for stringautomata, by providing an algorithm that constructs that equivalent automaton.We will give this algorithm within the framework from Section 3.3.4. I.e., we pro-pose a composite representation and provide the functions needed by the generalconstruction algorithm.

We simulate a parallel run of the non-deterministic automaton in the compositerepresentation, such that for each non-deterministic result, the different optionsare followed simultaneously. A composite tree state is therefore represented by aset of tree states, where each of these states refers to a possible non-deterministicoutcome. Also the composite string states are sets of string states, where each

3.4 Summary 61

string state indicates a possible state the non-deterministic string automaton mightreach for a given input sequence.

The function getInitialComposite(a, list) returns the set of initial states thatis the result of the non-deterministic α function of the original automaton. Thefunction getCompositeOutputT(cT , list) returns accept when at least on of thestates in cT has accept as output. The function getCompositeOutputS(cS, list)returns a set containing the union of the non-deterministic outputs of each elementof cS . The function getCompositeTransition(cS, cT , list) returns a set that containsthe union of the transitions from each element of cS , given each tree state in CT .This function is shown in Algorithm 3.5.

Algorithm 3.5 Function getCompositeTransition for the determinization opera-torInput: A composite string state cS , a composite tree state cT and a list: (Tnd),

containing the original non-deterministic automaton.Output: The next composite string state.1: next = ∅2: for all stateS ∈ cS do3: for all stateT ∈ cT do4: next = (next ∪ δSnd(stateS, stateT )) \ {nil}5: end for6: end for

Example 3.17 The tree automaton shown in Figure 3.21.a is non-deterministic.For the symbol ‘a’, there exist two initial string states (non-deterministic α), andfrom state 1, given tree state C, transitions lead to either state 3 or state 4 (non-deterministic δS). The output function φS in this automaton is deterministic.

In Figure 3.21.b, the composite representation of the determinization of theautomaton from Figure 3.21.a is shown. The initial state for the symbol ‘a’, is theset {1, 2}. Its output is the set containing the outputs of its elements: {A,B}. Thetransition from this state, given the composite tree state {C} results in the state{3, 4}, which is the union of the transition from state 1, given C ({3, 4}), and thetransition from state 2, given C ({}).

3.4 Summary

In this chapter, we started with discussing string automata and their properties.The main goal of this part was to provide a selection of string automata conceptsfrom literature, in order to refer to them when talking about their extensions fortree automata. Our contributions here are the definition of a general framework

62 Automata

a) b)

Figure 3.21: A non-deterministic automaton(a), and the composite representationof its determinization.

for operators for string automata, and the section on input minimization, of whichwe could not find any mention in literature.

The remainder of the chapter deals with tree automata. Our contribution isto present a new representation for the transition function of a tree automaton.For this representation we also define a general framework for operators, and weillustrate its use by defining simple operators like: copy, negation, union, andintersection. Furthermore, we show how to make an automaton deterministic,and how to find its minimal equivalent automaton. Concerning minimization weshow that there exists a unique minimal set of tree states for all equivalent treeautomata, regardless of the representation of their transition function. With regardto our new representation we proved that for every automaton, an equivalentautomaton exists with a minimal set of tree states, and a set of string states for itstransition function, that is minimal over all equivalent automata. Additionally, weprovided an algorithm to find this minimal equivalent automaton in polynomialtime.

We also discussed alternative representations for the transition function, foundin literature, and compared them to our own representation. All representationsare equally expressive, but have different qualities. A first alternative we dis-cussed (Kosala et al. 2003; Neven 2002), has, in contrast to other representationsdiscussed (including ours), the disadvantage that its minimal automaton exists butis not unique, while minimization itself is NP-complete. Furthermore, running atree automaton in this representation is less efficient. The other two alternativeswe discussed, can be considered as subsets of our representation, with differentconstraints on the subsets. One representation (Cristau et al. 2005; Raeymaekersand Bruynooghe 2004) defines those tree automata in our representation, in whichthe transition function for an automaton has disjoint sets of string states, each

3.4 Summary 63

set containing states reached for a given input symbol. Another representation,STA (Carme et al. 2004), defines the subset of automata in which the outputof each string state of the transition function, results in a different tree state.Because these other two representations define subsets of the automata in our rep-resentation, the minimal automaton in our representation can be smaller than inthe other ones.

64 Automata

Chapter 4

Information Extraction withAutomata

In this chapter we present our approach to use automata for information extrac-tion. We define marked documents, and documents that are marked correctly withregard to an extraction task. Note that we will use the more generic term docu-ment, when we discuss properties that hold for both strings and trees. We discussautomata that accept only correctly marked versions of documents. Operationsto convert different types of these automata are described, with as practical appli-cation, combining wrappers. We conclude with an optimized method to performextraction using automata.

4.1 Marked Documents

Let M be a set of markers, where a marker can be an arbitrary symbol. Themarked alphabet ΣM is defined as ΣM = Σ ∪ {aX | a ∈ Σ, X ∈ M}. We call asymbol a ∈ Σ, an unmarked symbol and aX ∈ ΣM \ Σ, a marked symbol. WhenM is a singleton containing a single marker X, we will use ΣX as a shorthandnotation for ΣM . A marked sequence or string is defined as an element of Σ∗M ,and a marked tree is an element of T (ΣM ).

A marking of a document d is a marked version d′ of that document, in whichsome of the elements e are replaced by a marked version eX . To give a more formaldefinition of a marking, we first introduce an overloaded function strip that allowsus to obtain an unmarked version of a symbol, string or tree.

Definition 4.1 (strip function) The function strip : ΣM → Σ is defined as:strip(eX) = estrip(e) = e

65

66 Information Extraction with Automata

with e ∈ Σ and X ∈ M .

The function strip : Σ∗M → Σ∗ is defined as:strip(es) = strip(e)strip(s)strip(ε) = ε

with e ∈ ΣM and s ∈ Σ∗M .

The function strip : T (ΣM ) → T (Σ) is defined as:strip(f(s)) = strip(f)(strip(s))

with f ∈ ΣM , s ∈ T (ΣM )∗, and the function strip : T (ΣM )∗ → T (Σ)∗ defined as:strip(es) = strip(e)strip(s)strip(ε) = ε

with e ∈ T (ΣM ) and s ∈ T (ΣM )∗.

A marking of a given document can then be defined as any marked document forwhich the unmarked version equals the original document. For the cases wheredefinitions for strings and trees are similar, we introduce the notation doc(Σ)meaning either Σ∗ or T (Σ).

Definition 4.2 (Marking of a document) d′ ∈ doc(ΣM ) is a marking of adocument d ∈ doc(Σ) ⇐⇒ strip(d′) = d.

The set of all markings of an unmarked symbol s or document d is defined asrespectively markM (s) = {s′ ∈ ΣM | strip(s′) = s} and markM (d) = {d′ ∈doc(ΣM ) | strip(d′) = d}. Note that d ∈ markM (d): we call d the empty markingof d. We define the set of positions that are marked in a given marking as:

Definition 4.3 (Marker positions of a marking) Given d′ a marking of adocument d:

PM (d′) = {p ∈ P(d) | d′ ↓ p = (d ↓ p)X , with X ∈ M}.A marking d′ is a submarking of a marking d′′ if and only if the same nodes thatare marked in d′ are also marked in d′′, with the same marker. Marked nodes ind′′ are not necessarily marked in d′. Or more formally:

Definition 4.4 (Submarking of a marking) Given a document d and markeddocuments d′, d′′ ∈ markM (d):

d′ is a submarking of d′′ ⇐⇒ PM (d′) ⊆ PM (d′′) and∀p ∈ PM (d′) : d′ ↓ p = d′′ ↓ p.

We define a function split that disassembles a marking into single marker markingsthat together mark the same elements as the original marker. This means thatfor every element marked in the original marker, and only for those, a marking isreturned which has that specific element marked, and no other elements.

Definition 4.5 (Split of a marking) Given a marking m ∈ doc(ΣM ):split(m) = {m′ ∈ doc(ΣM ) | m′ submarking of m and PM (m′) is a singleton}.

4.2 Representing Wrappers 67

Note that the split of an empty marking d ∈ doc(Σ) is empty: split(d) = ∅. Andthat PM (m) =

⋃m′∈split(m) PM (m′).

A string or tree automaton D is a marked string or tree acceptor when Σi = ΣM

and Σo = {accept, reject}. A single marking acceptor is a marked documentacceptor that accepts exactly one marking per document (possibly the emptymarking). More formally:

Definition 4.6 (Single marking acceptor) An acceptor D is a single markingacceptor ⇐⇒ ∀ d ∈ doc(Σ),∃ dm : D accepts dm, d = strip(dm) and ∀ d′ ∈(markM (d) \ {dm}) : D rejects d′.

4.2 Representing Wrappers

To be able to represent wrappers as automata over marked documents, we define aconnection between a marking of a document and the extractions from a document:correct markings.

4.2.1 Correct Markings

An extraction task is defined on a domain of documents. A single field extractiontask extracts zero or more elements from such a document. We formally define anextraction task E as a mapping from a tree on the set of positions of the targetelements, E : t → 2P(t).

Consider the set of all markings, with a single marker X, of a document fromthe domain of that task. A marking is correct with regard to that extraction task,when each marked element, is an element which should be extracted.

Definition 4.7 (Correct marking) A marking t′ is a correct marking of t,with regard to an extraction task E, if and only if, t ∈ Dom E ⇒ PX(t′) ⊆ E(t).

If we refer simply to a correct marking, it should be clear from the context withregard to which extraction task that marking is correct. If a given marking iscorrect, it follows that also every of its submarkings is a correct marking, as themarked elements they have in common, are marked in the given marking andshould by definition be extracted.

A correct marking is complete, when every element that should be extractedis indeed marked.

Definition 4.8 (Complete Correct marking) A marking t′ is a completecorrect marking of t, with regard to an extraction task E, if and only if, t ∈ DomE ⇒ PX(t′) = E(t).

It is clear that while for every document multiple correct markings exist, thereexists exactly one complete correct marking per document, and every other correct


marking for a document is a submarking of the complete correct marking of thatdocument.

Marked document acceptors that accept only correct markings for a given taskare called correct marking acceptors for that task. Not every correct markingacceptor accepts the same set of correct markings. We pick out and name some ofthese sets. We denote with CCM the set of Complete Correct Markings. For allthe extraction tasks we consider in this thesis, we assume that CCM acceptor canbe represented with a regular marked tree acceptor. So, for example an extractiontask, in which a node is extracted when its position, in the list of children of itsparent, is a prime number, is outside our scope.

The set of all correct markings (always with regard to a given extraction task),is denoted as the Partial Correct Markings (PCM). The term ‘partial’ is used incontrast with ‘complete’, to indicate that not only complete correct markings areincluded. Those correct markings in which only one single element is marked,are called Single Correct Marker markings (SCM). Note that there exist as manysingle correct marker markings for a given document as there are elements markedin the complete correct marking. An empty marking is also a correct marking,because it respects the requirements, specified in the definition, on all ‘marked’elements . For some documents, the empty marking might even be the completecorrect marking. The empty markings for each document from the domain of theextraction task are included in the set of Empty Markings (EM). The set of correctmarkings containing all empty markings and all single correct marker markings,is called ESCM. All the other sets of correct markings are not explicitly named.Clearly every set of correct markings, is a subset of PCM.

For a marking m, it holds that PM (m) =⋃

m′∈split(m) PM (m′). This impliesthat a marking is correct if and only if each element of split(m) is a single markercorrect marking (we already know that strip(m), the empty marking, is alwayscorrect).

Example 4.1 Given an extraction task with as domain a set of strings in whicheach string is constructed as a concatenation of substrings. Each substring is either‘@@1’ or ‘@@2’. The extraction task requires that in parts of the string ending on‘1’, the first ‘@’ is extracted and in parts of the string ending on ‘2’, the second ‘@’is extracted. An example from these strings is ‘@@1@@2@@2’. This string is alsoan empty marking. The complete correct marking, with regard to this extractiontask, of this string is ‘@X@1@@X2@@X2’. An example of a single correct markermarking of this string is ‘@@1@@X2@@2’. In Figure 4.1, acceptors are shownthat accept respectively the sets PCM, CCM, EM, SCM, and ESCM. The PCMacceptor for example accepts the three example markings from above, while theCCM acceptor accepts only the complete correct marking.

Example 4.2 From the domain of all trees over the alphabet {a,@} we want toextract those elements and only those that are labeled ‘@’ and have as grandparent


PCM CCM EM

SCM ESCM

Figure 4.1: Examples of different correct marking acceptors for the extraction taskfrom Example 4.1. Some of the acceptors have their states labeled, such that theycan be referred to later on.

a node labeled ‘a’. Hence nodes that are root or that have the root as parent can notbe extracted, as they don’t have a grandparent. Acceptors for this extraction task,accepting respectively the sets PCM, CCM, EM, SCM, and ESCM, are shown in

Figure 4.2. Note that the tree

@ @

@ a

a

is in the domain. It is accepted as an empty

marking by the EM acceptor, and also by the ESCM and PCM acceptors. Thecomplete correct marking for this tree, with regard to the above extraction task,

is the marked tree

@X@X

@ a

a

. This marking is accepted by both the PCM and CCM


acceptors. The marking

@X@

@ a

a

, is correct and is therefore accepted by the PCM

acceptor. It is not complete, hence rejected by the CCM acceptor. As there is onlya single marker in the marking, it is accepted by the ESCM and SCM acceptors.

On the documents outside the domain of the extraction task, the correctness ofa marking is not defined. Therefore there are multiple possible (non-equivalent)acceptors for the same set of correct marking acceptors. An acceptor that acceptsCCM could reject all markings for documents outside the domain, or accept onlyempty markings for these documents, or accept all possible markings, or any subsetof these markings, for documents outside the domain. This freedom can be used topursue more robust acceptors in case the domain of the task changes. Furthermore,when the extraction task is learned, the domain is not always well-known, or well-defined. Therefore it is interesting to try to extrapolate the rules in the acceptor,to operate outside the domain.

Example 4.3 The EM acceptor that is shown in Figure 4.1 for the extractiontask from Example 4.1, accepts exactly the domain from the extraction task. Allstrings outside the domain are rejected. The EM acceptor shown in Figure 4.3,does accept all empty markings of the strings in the domain, hence it is an EMacceptor, but it also accepts all other strings that can be formed with the samealphabet.

The CCM acceptor from Figure 4.1, rejects all markings from strings outsidethe domain. In Figure 4.3 we show an example of a CCM acceptor that doesaccept some markings from outside the domain. For strings from the domain,the results of both acceptors are of course the same. Note that the CCM acceptorfrom Figure 4.3, is not a single marking acceptor, as it accepts both markings‘@X@@X1’ and ‘@@X@X1’ for the string ‘@@@1’ from outside the domain.

4.2.2 Extraction

We show now that we can use some correct marking acceptors to extract the targetelements for the associated extraction task. One can feed all possible markingsof a document to a CCM acceptor. The single marking that gets accepted is acomplete correct marking, and therefore yields all the extractions. This showsthat CCM acceptors can be used to represent a wrapper for that extraction task.In practice though, this is not a feasible approach, since the number of markedversions of a document equals (m + 1)n, with m the number of different markersand n the number of elements in the document.

Another approach is to iterate over all possible Single Correct Marker markingsof a document. When such a marking is accepted by an SCM or PCM acceptor, the


PCM CCM EM

SCM ESCM

Figure 4.2: Examples of different correct marking acceptors for the extraction taskfrom Example 4.2. Some of the acceptors have their states labeled, such that theycan be referred to later on.


EM CCM

Figure 4.3: Examples of EM and CCM acceptors for the extraction task fromExample 4.1, that do accept markings of strings outside the domain.

element marked in that specific marking is added to the collection of extractions.This approach is practically applicable as it requires only m × n runs over thedocument. Hence it is practically feasible to represent a wrapper as an SCMor PCM acceptor. In Section 4.4 we propose yet another approach based ona CCM acceptor necessitating only a single run over the document to collect theextractions. Hence also a CCM acceptor can be used in a practical way to representwrappers.

As the acceptor resulting from some wrapper induction approach is not nec-essary in the form that allows for the most efficient extraction, there is a needfor automata operations converting between different forms. Also for combiningtwo wrappers, these automata have to be in a suitable form. Below we define thenecessary operators, such that they can be executed with the general constructionalgorithm from Chapter 3.

4.2.3 Conversion from CCM to PCM

The CCM acceptor for a given extraction task accepts a single correct markingfor every document in the domain. From a complete correct marking of a givendocument we can derive the partial correct markings of that document by removingzero or more markers. These markings will not be accepted by the CCM acceptor(except when zero markers are removed). We explain this conversion procedurefirst for string automata, and later we explain the extension to tree automata.

We can use the CCM string acceptor to accept a partial correct marking, if wechoose for each element with the marker stripped off to follow the original transi-tion for the marked symbol instead of the transition for the unmarked symbol. Butwhen we process an arbitrary marking, we do not know which unmarked symbols


might be stripped and which might not be stripped. Hence for every unmarkedsymbol in the sequence we can, from the state reached after processing the sequenceup to that element, either follow the transition with that unmarked symbol (if itexists), or follow each other transition leaving that state with that symbol markedwith some marker (the set of markers M can contain multiple markers). Thismeans that the transition for an unmarked symbol is non-deterministic. Based onthe transition function δ from the CCM acceptor we can define a non-deterministicfunction δ′ as δ′(q, a) = {δ(q, a)} ∪ {δ(q, a′) | a′ ∈ ΣiM and strip(a′) = a}, withq ∈ Q and a ∈ Σi, where Q and Σi are respectively the set of states and the inputalphabet of the CCM acceptor. The non-deterministic automaton, resulting fromreplacing the transition function δ in the CCM acceptor with δ′, will accept the setPCM for the same extraction task. Note that for marked symbols, δ′ results in asingleton: δ′(q, aX) = {δ(q, aX)}. Hence δ′ is only non-deterministic for unmarkedsymbols. A PCM acceptor is obtained by performing the determinization operatorfrom Section 3.2.4 on the non-deterministic automaton based on δ′.

We define a conversion operator directly from a CCM acceptor to a PCMacceptor called CP. Instead of having an intermediate non-deterministic automa-ton, we incorporate δ′ inside the function getCompositeTransition(c, a, list) ofthe determinization operator. The function getCompositeTransition for the CPoperator is shown in Algorithm 4.1. The functions getInitialComposite(list) andgetCompositeOutput(c, list), stay the same as in the determinization operator.

Algorithm 4.1 Function getCompositeTransition for the CP operator (string)Input: A composite state c, an input symbol a and a list (ACCM ), containing the

CCM acceptor.Output: The next composite state.1: next = ∅2: for all state ∈ c do3: next = (next ∪ δCCM (state, a)) \ {nil}4: if unmarked(a) then5: for all marker ∈ M do6: a′ = amarker

7: next = (next ∪ δCCM (state, a′)) \ {nil}8: end for9: end if

10: end for

Example 4.4 For the extraction task from Example 4.1, we have applied the CPoperation on the CCM acceptor shown in Figure 4.1. The result of this operationis shown in Figure 4.4.

For the conversion of a CCM tree acceptor to a PCM tree acceptor we proceed in a


Figure 4.4: The composite representation of the CCM string acceptor from Fig-ure 4.1 converted to a PCM acceptor.

similar fashion as for string acceptors. In a first step we design a non-deterministicautomaton accepting the PCM set to illustrate the principle behind the conversion,and in a second step we define a conversion operator performing the conversiondirectly from a CCM tree acceptor.

In a tree acceptor, the function α results in an initial string state for an un-marked symbol. To allow for the possibility that the unmarked symbol mightbe a stripped version of that symbol in some complete correct marking, the non-deterministic function α′ should map that unmarked symbol also to the initialstring states associated to the marked versions of that symbol by α. The functionα′ is therefore defined as α′(a) = {α(a)} ∪ {α(a′) | a′ ∈ ΣM and strip(a′) = a}.The transition function δS and the output function φS stay unchanged (determin-istic).

The CP operator for tree automata, is defined to perform the conversion be-tween a CCM tree acceptor for a given extraction task to a PCM tree acceptorfor the same extraction task. This operator is defined in the same way as thedeterminization operator from Section 3.3.6, except for the function getInitial-Composite(a, list). The replacement getInitialComposite function for the CPoperator is shown in Algorithm 4.2. In this function, the non-deterministic out-come of α′ is calculated directly from the function α from the CCM acceptor whichis the argument for the CP operator.

Example 4.5 For the extraction task from Example 4.2, we have applied the CPoperation on the CCM acceptor shown in Figure 4.2. The result of this operation isshown in Figure 4.5. For the symbol ‘@’ the initial string state consists of the set ofstring states {2, 3}. The output of this state is a set of tree states {B,C}. From thisinitial state, the composite tree state {C} leads to the composite string state {7, 9},while from the composite tree state {B,C}, a transition leads to a composite stringstates with even more elements: {6, 7, 8, 9}. The automaton shown in Figure 4.5is equivalent to the PCM acceptor from Figure 4.2. Only after a minimizationoperation we do get the same automaton as in Figure 4.2.


Algorithm 4.2 Function getInitialComposite for the CP operator (tree)Input: An input symbol a and a list: (TCCM ), containing the CCM acceptor.Output: The initial composite state.1: initial = ∅2: initial = (initial ∪ αCCM (a)) \ {nil}3: if unmarked(a) then4: for all marker ∈ M do5: a′ = amarker

6: initial = (initial ∪ αCCM (a′)) \ {nil}7: end for8: end if

4.2.4 Conversion from PCM to CCM

We start with describing a method to use a PCM string acceptor to accept completecorrect markings only. Then we define a composite representation, together withthe necessary functions to simulate this method. This then immediately definesthe conversion operator, given the framework from Section 3.2.2.

A PCM acceptor for a given extraction task accepts all possible correct mark-ings of a document in the domain. We now want to accept only the completecorrect marking. For a given marking accepted by the PCM acceptor, we caniterate over all the markings for which that marking is a submarking, and checkthem with the PCM acceptor. If none of these markings is accepted by the PCMacceptor, that marking is the complete correct marking. Otherwise one of the ac-cepted markings is the complete correct marking, and the original marking shouldbe rejected.

Practically, there is no need to iterate over all these ‘supermarkings’. We onlyneed to keep track of the set of states these alternative markings would end up in(next to the state that the original marking ends up in). If an intermediate state isreached, and the next symbol of the string is unmarked, we will not only follow thetransition with the unmarked symbol, but also the ones which have that symbolmarked (if any). This way the alternative markings (supermarkings) of our originalmarking which have that specific symbol marked are taken into account. Whenthe final state is reached we can discern three situations. In the first situation, thefinal state is a rejecting state, and all the final states for the alternative markingsare also rejecting states. In this case it is clear, the original marking is not even acorrect marking and is therefore rejected. In the second situation, the final stateis an accepting state, and none of the final states for the alternative markings isan accepting state. Hence the original marking is the complete correct marking,as no correct supermarking could be found. In the third situation, the final stateis an accepting state, and some of the final states for the alternative markings areaccepting states. This means that even though the original marking is correct,


Figure 4.5: The composite representation of the CCM tree acceptor from Figure 4.2converted to a PCM acceptor.

more complete correct supermarkings are found and hence the original markingshould be rejected. A fourth combination, namely a final state that is a rejectingstate, and some accepting final states for the alternative markings, is impossible.If some supermarking is correct, then the original marking should be correct bydefinition.

A composite representation to simulate this behavior consists of string statesrepresented by a pair of an ‘original state’ and a set of ‘alternative states’. The‘original state’ is the state reached by the original marking, while the set of ‘alter-native states’ consists of all the states reached by the alternative (super)markings.Before we go on to define the functions needed in the framework defined in Sec-tion 3.2.2, some remark concerning optimization: when the original state is in theset of alternative states, only two situations are possible. Either the original stateis an accepting state, and there exists an accepting state in the set of ‘alterna-tive states’ (that same original state), or the original state is a rejecting state,and all alternative states are also rejecting states. In both situations the originalmarking cannot be the complete correct marking. Moreover, for every compositestate reached from such a composite state, it will still hold that its original stateis contained in its set of alternative states. Summarized, such a state is a reject-ing state, and every state that can be reached from it will again be a rejectingstate. Therefore such a state is a dead state and all transitions leading to it canbe removed.


We now give the functions defining the operator to convert a given PCM ac-ceptor to a CCM acceptor. We call this operator, the PC operator. The func-tion getInitialComposite returns the pair consisting of the initial state of thePCM acceptor and an empty set. The composite state returned by the functiongetCompositeTransition, has as original state, the result from the transition fromthe original state of the starting composite state. The set of alternative states ofthat state is the union of the set of states reached by all alternative transitions fromthe original state in the starting composite state, with the set of states reachedby all possible transitions (with the unmarked symbol and all possible markers)from the set of alternative states of the starting composite state. But when theset of alternative states in the new composite state would contain its originalstate, the function will return nil. This function is shown in Algorithm 4.3. ThegetCompositeOutput function returns accept for a given composite state, onlywhen the original state of that state is accepting and all the alternative states arerejecting.

Example 4.6 We illustrate the conversion of the PCM acceptor, shown in Fig-ure 4.1, for the extraction task from Example 4.1, into a CCM acceptor. Thecomposite representation for the resulting CCM acceptor is shown in Figure 4.6.The initial state is the composite state (1, {}). With symbol @, this state leads tothe composite state (3, {2}). This state represents the states reached by an originalsequence ‘@’, and by the alternative sequence ‘@X ’. The composite state reachedfrom there with symbol @ is (5, {4, 6}), representing the states reached by the orig-inal sequence ‘@@’, and alternative sequences ‘@X@’ and ‘@@X ’. From (5, {4, 6})transitions with symbols 1 and 2 lead to composite state (1, {1}). As this is a deadstate, these transitions are not added. State 1 in the PCM acceptor is an accept-ing state. Therefore (1, {1}) indicates that ‘@@1’ or ‘@@2’ are accepted as correctmarkings, but also their respective supermarkings ‘@X@1’ and ‘@@X2’.

As (5, {4, 6}) is a rejecting state, and does not lead to any accepting state, itis equivalent to the dead state, and will be removed in a minimization step thatresults in the CCM acceptor from Figure 4.1.

The principle for the conversion of a PCM tree acceptor is the same as for a PCMstring acceptor. The alternative runs are inserted through the choice of the initialfunction. For an unmarked symbol, the composite initial string state associatedwith it, has as original state, the original initial string state associated with thatsymbol, and as alternative states, the original initial states associated with themarked versions of that symbol. This ensures that runs on alternative markings ofthe tree (supermarkings of the original marking) are processed in parallel, next tothe run on the original marking. This parallel run is also reflected in the output ofthe string states, the tree states. Each tree state consists of an original tree statethat is the output of the run on the original marking, and a set of alternative treestates, being the result of each of the runs on the alternative markings.


Algorithm 4.3 Function getCompositeTransition for the PC operator (string)Input: A composite state c, an input symbol a and a list (APCM ), containing the

PCM acceptor.Output: The next composite state.1: if δPCM (c.original, a) = nil then2: next = nil3: else4: next.alternative = ∅5: next.original = δPCM (c.original, a)6: if unmarked(a) then7: for all marker ∈ M do8: a′ = amarker

9: next.alternative = (next.alternative ∪ δPCM (c.original, a′)) \ {nil}10: end for11: end if12: for all state ∈ c.alternative do13: next.alternative = (next.alternative ∪ δPCM (state, a)) \ {nil}14: if unmarked(a) then15: for all marker ∈ M do16: a′ = amarker

17: next.alternative = (next.alternative ∪ δPCM (state, a′)) \ {nil}18: end for19: end if20: end for21: if next.original ∈ next.alternative then22: next = nil23: end if24: end if


Figure 4.6: The composite representation of the PCM string acceptor from Fig-ure 4.1 converted to a CCM acceptor.

The output of a tree state is ‘accept’ if and only if, the original tree state isan accepting state, and all alternative tree states are rejecting. Tree states whichhave their original tree state contained within the alternative tree states, are stilldead states. When the original state of a composite string state is also one of itsalternative states, it follows that its output (and the output of all string statesreached from that string state) is a tree state with a set of alternative states thatcontains its original state. Hence such a composite string state returns as outputa dead tree state just like all string states reachable from it, and is therefore alsoa dead state, and transitions to it can be removed.

The function getInitialComposite(a, list) returns the pair, with as first el-ement α(a) and as second element the set of all α(aX) that are not nil, withX ∈ M . Algorithm 4.4 shows the function getCompositeTransition. The func-tion getCompositeOutputS(cS , list) returns a composite tree state that has asoriginal tree state, the output from the original state of composite state cS . Theset of alternative tree states of this state, is the set of the outputs of the alterna-tive states of cS . The function getCompositeOutputT returns accept for a givencomposite tree state, only when the original tree state of that state is acceptingand all the alternative tree states are rejecting.

Example 4.7 The automaton shown in Figure 4.7 is the composite representa-tion of the result of the PC operator on the PCM acceptor shown in Figure 4.2.This result is a CCM acceptor for the extraction task from Example 4.2.

4.2.5 Conversion from SCM to PCM

The PCM acceptor accepts also empty markings. The SCM acceptor though, doesnot necessarily have all the information about empty markings, for example whenthe empty marking is also a complete correct marking for some document. Wewill therefore start the conversion to a PCM acceptor from a ESCM acceptor.


Algorithm 4.4 Function getCompositeTransition for the PC operator (tree)Input: A composite string state cS , a composite tree state cT and a list: (TPCM ),

containing the PCM acceptor.Output: The next composite string state.1: if δPCM (cS .original, cT .original) = nil then2: next = nil3: else4: next.alternative = ∅5: next.original = δPCM (cS .original, cT .original)6: for all stateS ∈ cS .alternative do7: next.alternative = (next.alternative∪δPCM (stateS, cT .original))\{nil}8: end for9: for all stateT ∈ cT .alternative do

10: next.alternative = (next.alternative∪δPCM (cS .original, stateT ))\{nil}11: end for12: for all stateS ∈ cS .alternative do13: for all stateT ∈ cT .alternative do14: next.alternative = (next.alternative ∪ δPCM (stateS, stateT )) \ {nil}15: end for16: end for17: if next.original ∈ next.alternative then18: next = nil19: end if20: end if


Figure 4.7: The composite representation of the PCM tree acceptor from Figure 4.2converted to a CCM acceptor.

This poses no extra problem, as the ESCM acceptor is easily constructed as theunion of an EM acceptor and an SCM acceptor. To check whether a marking m iscorrect (accepted by a PCM acceptor), we can check whether strip(m), and eachelement of split(m) are accepted by the ESCM acceptor.

Given a marking of a string: m ∈ Σ∗M , and a position p ∈ PM (m). Toprocess m by ESCM, as if it is the element of split(m) for which the element atposition p is marked, we process each symbol in m before that position as if it isunmarked. At position p we process the marked symbol itself, and past position peach symbol is again processed as if it is unmarked. For every element of split(m),the intermediate states reached before the marked element are the same as theintermediate states reached by strip(m).

When we run all the markings from split(m) in parallel on the ESCM acceptor,one state suffices to represent the state of the empty marking and those markingsfrom split(m) for which their marked element is not yet reached. We call this state,the ‘prefix state’. The states reached by those markings for which the processinghas already passed the single marked element, are kept in a separate set. We callthis set the ‘suffix states’. As composite representation for a state, we use a pairwith as first element the prefix state, and as second element the set of suffix states.The original marking is correct if and only if the prefix state, and each of the suffixstates are accepting states. When one of these states is a dead state, it followsthat the whole composite state is a dead state.


To process a symbol a from the original marking, the prefix state becomes thestate, reached with strip(a), from the previous prefix state. From each of thesuffix states the transition with strip(a) is made, and the resulting states form thenew set of suffix states. If the symbol a is marked, the transition for the markedsymbol a is followed. This way the marking of split(m) that has the currentelement marked, is processed. The resulting state, is the state for that markingafter its marked element, hence this state is also added to the set of suffix states.

These observations lead to the following implementation of an operator thatconverts an ESCM acceptor to a PCM acceptor. We call this operator, the ESPoperator. The initial composite state (getInitialComposite) has as prefix state, theinitial state of the ESCM acceptor, and an empty set as its set of suffix states.Each composite state is rejecting unless all its compound states are accepting(getCompositeOutput). The behavior of the function getCompositeTransition isdescribed above, and its pseudo-code can be seen in Algorithm 4.5.

Algorithm 4.5 Function getCompositeTransition for the ESP operator (string)Input: A composite state c, an input symbol a and a list (AESCM ), containing

the ESCM acceptor.Output: The next composite state.1: if δESCM (c.prefix, strip(a)) = nil then2: next = nil3: else4: next.suffix = ∅5: next.prefix = δESCM (c.prefix, strip(a))6: for all state ∈ c.suffix do7: next.suffix = (next.suffix ∪ δESCM (state, strip(a))) \ {next.prefix}8: end for9: if marked(a) then

10: next.suffix = (next.suffix ∪ δESCM (c.prefix, a)) \ {next.prefix}11: end if12: if nil ∈ next.suffix then13: next = nil14: end if15: end if

Adding the prefix state of a composite state to its set of suffix states, does notchange anything to the output of that composite state. The states reached fromthat composite state will all get their prefix state added to their suffix states, buttheir output will also remain unchanged. This implies that a composite state isequivalent to a version of itself with its prefix state added or removed from itssuffix states. In Algorithm 4.5, states are always created in a form without theprefix state contained in the suffix states. This way the resulting automaton will


stay smaller.

Example 4.8 We illustrate the ESP operator for string automata, by applyingit to the ESCM acceptor from Example 4.1. The composite representation of theresult is shown in Figure 4.8. The constituent tree and string states refer to therepresentation of the ESCM operator in Figure 4.1. Note that the states (5, {4, 6})and (5, {4, 6, 9}) are rejecting states, and no transitions lead from them to nondead states. Hence these states are equivalent to dead states themselves and will beremoved in a minimization step. This minimization step results in an automatonas the PCM acceptor shown in Figure 4.1.

If the string ‘@@X ’ is processed, the automaton in the composite representationend in state (5, {6}), where state 5 is the state reached in the original ESCMacceptor for the string ‘@@’, and state 6, the state reached for the single markingstring ‘@@X ’.

Figure 4.8: The composite representation of the ESCM string acceptor from Fig-ure 4.1 converted to a PCM acceptor.

For the conversion of tree acceptors, the principle is the same, in that we processthe unmarked version of the original tree along with all possible single marked con-stituents of the original tree. Both, composite string and tree states, keep track of aprefix state, and a suffix state. The function getInitialComposite(a, list) returns acomposite state, with an empty suffix set for unmarked symbols a, and with a suf-fix set containing the initial state associated (in the original ESCM acceptor) withthe symbol a, for marked symbol a. The prefix state is always the initial state as-sociated with the unmarked symbol. The function getCompositeOutputS(cS , list)returns a composite tree state that has the output of the prefix state of cS as pre-fix state, and the outputs of the suffix states as its set of suffix states. If eitherthe prefix tree state, or one of the suffix tree states, is a dead state, the deadtree state is returned instead. The function getCompositeOutputT (cT , list) onlyreturns accept when the prefix state of cT is an accepting state, just like all thesuffix states of cT .


Regarding the function getCompositeTransition(cS , cT , list): the transitionfrom a composite string state cS will lead, given a composite tree state cT , toa new composite string state that has as prefix state the state reached by anunmarked tree. This is the state reached from the prefix state from cS , with theprefix state from cT . The set of suffix states of the new state, is the union ofthe states reached from the old suffix states cS .suffix, given the tree state of theunmarked tree cT , and the set of new suffix states, reached from the prefix statecS .prefix, given the tree states of the processed single marking trees cT .suffix.The pseudo code for this function is listed in Algorithm 4.6.

Algorithm 4.6 Function getCompositeTransition for the ESP operator (tree)Input: A composite string state cS , a composite tree state cT and a list (TESCM ),

containing the ESCM acceptor.Output: The next composite string state.1: if δESCM (cS .prefix, cT .prefix) = nil then2: next = nil3: else4: next.suffix = ∅5: next.prefix = δESCM (cS .prefix, cT .prefix)6: for all stateS ∈ cS .suffix do7: next.suffix = (next.suffix∪δESCM (stateS, cT .prefix))\{next.prefix}8: end for9: for all stateT ∈ cT .suffix do

10: next.suffix = (next.suffix∪δESCM (cS .prefix, stateT ))\{next.prefix}11: end for12: if nil ∈ next.suffix then13: next = nil14: end if15: end if

Example 4.9 The ESCM acceptor from Example 4.2, as it is shown in Fig-ure 4.2, is converted with the ESP operator to create a PCM acceptor for the sameextraction task. The composite representation of the result is shown in Figure 4.9.After minimization, this automaton will reduce to the equivalent PCM acceptorshown in Figure 4.2.

Note that the ESP operator works with every correct marking acceptor that ac-cepts at least the set ESCM. Hence the ESP operator applied to a PCM acceptor,will still result in a PCM acceptor.

4.3 Combining Wrappers 85

Figure 4.9: The composite representation of the ESCM tree acceptor from Fig-ure 4.2 converted to a PCM acceptor.

4.3 Combining Wrappers

Given an extraction task that extracts all elements belonging to a field X, andan extraction task that extracts all elements belonging to a field Y , the combinedextraction task will extract the elements for both tasks, while still indicating whichelement belongs to which field. Hence, a correct marking for this combined task,marks all elements extracted by the first extraction task with X, and those ex-tracted by the second with Y . Typically each element belongs only to one field,but for some extraction schemes elements might have multiple different markers.For now we will assume that there is no overlap between the two tasks. Later, inSection 4.3.1, we will discuss the other case. We will also make the assumptionthat both extraction tasks share the same domain. Otherwise, combining them


would not make much sense.The set of single marker correct markings for the combined wrapper (CCM -

X,Y ), is the set of all markings that are either a single marker correct markingfor the X extraction task (CCM -X) or a single marker correct marking for theY extraction task (CCM -Y ). Hence ESCM -X,Y = ESCM -X ∪ ESCM -Y .As ESCM -X ⊂ PCM -X and ESCM -Y ⊂ PCM -Y , we know that ESCM -X,Y = ESCM -X ∪ ESCM -Y ⊂ PCM -X ∪ PCM -Y . This is the requirementsuch that the ESP operator applied on the union of the PCM -X acceptor andthe PCM -Y acceptor, would return the PCM -X,Y acceptor. This leads to theschema shown in Figure 4.10, to combine two wrappers.

CPCCM -X PCM -X PCM -Y

CPCCM -Y

PCM -X ∪ PCM -Y

ESP

PCM -X,Y

PC

CCM -X,Y

Figure 4.10: The schema shows the operations that have to be performed to com-bine two wrappers in CCM format to a new wrapper in the same format. Notethat the different acceptors in the schema are denoted by the set that they accept.

Example 4.10 Given the set of trees constructed with the alphabet {a, b,@}, forwhich the root of each tree is the symbol a, and each subtree under the root islabeled with the symbol b, and has exactly two children, both leaves. This set isextracted by the EM acceptor depicted in Figure 4.11. On this domain we definetwo extraction tasks. A first task marks an element with an X, if and only ifits previous sibling is labeled with the symbol a. A second task marks an elementlabeled with the symbol b, when its second child is labeled with the symbol @. TheCCM acceptors for these tasks are also shown in Figure 4.11.

We can use the approach described above to combine these two CCM acceptorsinto a new CCM acceptor that marks elements with X or Y , as requested in thetwo extraction tasks. This new CCM acceptor is presented in Figure 4.12.

4.3 Combining Wrappers 87

EM CCM-X CCM-Y

Figure 4.11: The EM acceptor for the domain of the extraction tasks of Exam-ple 4.10, and the CCM acceptors for each task.

4.3.1 Overlap between Extraction Tasks

A single element has normally only one marker, as each element typically belongsto only one field. In the following extraction task we want to extract the subse-

Figure 4.12: The CCM acceptor that combines the CCM acceptors from Fig-ure 4.11


quence of a sequence that is enclosed between round brackets. We could markthis subsequence, by indicating its begin and end, with the markers B and E, asin ‘@@@(@B@@@E)@@’. This extraction task is an example of overlap betweenextraction tasks. In the sequence ‘@@(@BE)@@@’, we see that the begin and theend of the subsequence is the same element. Therefore this element is markedtwice.

To check whether two extraction tasks overlap, we take the PCM acceptors forthe two tasks. We change the symbol used for the markers in both acceptors tothe same symbol and we take the intersection of the two. If no marked symbolexists in the resulting automaton, there is no overlap. Otherwise, if an elementcould be assigned to different fields, there would exist a single marker marking inwhich that element is marked, that is accepted by both acceptors, and hence alsoin their intersection.

A solution to combine wrappers that overlap, is to create first three new ac-ceptors, that accept disjoint sets. Namely the set of doubly marked elements,the set of elements accepted solely by the first acceptor, and the set of elementsaccepted solely by the other acceptor. These acceptors can then use respectivelythe markers ‘XY ’, ‘X’, and ‘Y ’. Where the special marker ‘XY ’ indicates thedoubly marked elements. The combination of these three acceptors can then beperformed as described earlier for disjoint wrappers. These acceptors are definedas follows:

• An acceptor ESCM -XY = ESCM -X ∩ESCM -Y which is the intersectionspecified in the previous paragraph, and that uses ‘XY ’ as marker symbol.

• An acceptor ESCM -X ′ = (ESCM -X \ ESCM -XY ) ∪ EM , that uses ‘X’as marker symbol.

• An acceptor ESCM -Y ′ = (ESCM -Y \ ESCM -XY ) ∪ EM , that uses ‘Y ’as marker symbol.

The subtraction and intersection operations are performed in the same fashion asin the previous paragraph. The different markers are first changed into one desiredsymbol and then the operation is performed. In the case that the original wrappersare already combined wrappers (containing more than one marker symbol), theapproach is similar, although a bit more intricate.

4.4 Extraction in a Single Run

To represent the set of extractions, each extraction is represented as an elementtogether with the associated field, as our aim is to extract elements instead of onlytheir values (see Section 2.2.5). In this section we will use its index in the string ortree as a reference to an element. In a practical implementation this can of course

4.4 Extraction in a Single Run 89

be replaced with any valid type of reference. We denote the set of all possibleextractions as X = P ×M , with P the set of possible references and M the setof marker symbols. In this section we first show how to extend automata, so thatthey also return a list of extractions, in addition to the indication of acceptanceor rejection. Afterwards we show how to run such an automaton as if it runs onall possible markings of a document, finally returning the extractions associatedwith the completely correct marking of the document.

4.4.1 Keeping Track of Extractions

To keep track of the extractions made during the execution of a string markingacceptor, we redefine the string automaton. A new set of states is used, in whicheach state represents an original state q, together with the list of elements extractedfrom the string that was processed to reach that state. The input alphabet ofthis adapted automaton contains references to elements of the strings, instead ofthe values of their elements. We will use as a reference, a pair that consists ofthe value of the element being processed and the index of that element. Moreformally, a given FSA (Σi,Σo, Q, q0, δ, φ), for which Σi = ΣM , is redefined as(Σ′i,Σ

′o, Q

′, q′0, δ′, φ′) where Σ′i = {(a, p) | a ∈ ΣM , p ∈ P(s)}, Σ′o = Σo ×X ∗, Q′ =

Q×X ∗, q′0 = (q0, ε). The adapted transition function δ′ : Q′ ×Σ′i → Q′ is definedas: δ′((q, w), (a, p)) = (δ(q, a), w) if the value is unmarked, or δ′((q, w), (aX , p)) =(δ(q, aX), (p,X)w) when the value is marked, where q ∈ Q, w ∈ X ∗, and (a, p) ∈Σ′i. Finally, the new output function is defined as φ′((q, w)) = (φ(q), w). Thismodified automaton is actually not a finite state automaton, as the set of statesQ′ is not finite. But this modification does not change the actual semantics of theautomaton. The state part of the result of a transition still depends only on theinput state and the next symbol. The new list of extractions only depends on thecurrent list and the next symbol. This automaton can be run, based directly onthe original automaton, given some changes in the actual implementation of theautomaton execution.

Example 4.11 To illustrate a run of a modified marking acceptor, we introducea string automaton with two markers: M = {P,N}. The strings in the domainare all constructed from the following three basic strings: ‘@+’, ‘@−’, and ‘@?’.The extraction task requires that an element containing ‘@’, is marked with P ,when it is followed by an element with the symbol ‘+’, that it is marked with N ,when it is followed by an element with the symbol ‘−’, and that it stays unmarked,when it is followed by an element with the symbol ‘?’. A CCM acceptor for thisautomaton is presented in Figure 4.13.a.

When this acceptor is run on a string ‘@P + @?@N−’, the list of extractionsconsists of ((4, N), (0, P )). The run, keeping track of the extractions, is illustratedin Figure 4.13.b. The index of each element is shown on top of it in the inputstring.


0 @0

P+1

@2

?3

@4

N -5

1 (0, P ) +1

@2

?3

@4

N -5

0 (0, P ) @2

?3

@4

N -5

3 (0, P ) ?3

@4

N -5

0 (0, P ) @4

N -5

2 (4, N)(0, P ) -5

0 (4, N)(0, P )

a) b)

Figure 4.13: A CCM string acceptor with markers P and N (a), and its extractingrun on the marked string ‘@P + @?@N−’ (b).

For FTA’s, the situation is a bit more complex. When a tree is processed wewant to keep the extractions from that tree in the resulting tree state. In thestring states, those extractions are collected that are found in tree states of eachof the child trees, that are already processed. Therefore both string and treestates are redefined to keep a list of extractions. Formally, the transition functionδT is redefined such that α′((a, p)) returns (α(a), ε) when the value is unmarked,and α′((aX , p))returns (α(aX), (p,X)) when the value is marked. Further is δS

redefined as δ′S((q, E), (p, F )) = (δS(q, p), EF ), where E,F ∈ X ∗, q ∈ QS , andp ∈ QT . And the output function φS is redefined as φ′S((q, E)) = (φ(q), E).The output function of the redefined tree acceptor itself becomes φ′T ((q, w)) =(φT (q), w).

Example 4.12 We illustrate the extraction process for the CCM acceptor fromFigure 4.12. The run shown in Figure 4.14, starts with processing the leafs aat index 3, and @X at index 4 (We have added indices on the nodes of the firsttree, such that we can refer to them). The function α′ returns the states (1, ε)with as output (A, ε) and(6, (4, X)) with as output (E, (4, X)). For the elementwith index 2, α′ returns the string state (3, (2, Y )). From this string state we goto state (10, (2, Y )), given the tree state (A, ε) (see dotted arrow). With input(E, (4, X)), the string state (12, (4, Y )(2, Y )) is reached. This final state has asoutput the tree state(F, (4, Y )(2, Y )). The resulting tree state for the original treeis (G, (5, Y )(4, Y )(2, Y )), where G is an accepting state, and the accompanying listcontains the extractions.

The memory needed during the run is no longer bounded. Additional memory isneeded to keep the sequence, the requested list of extractions, in memory. But sincethe CCM acceptor is a single marking acceptor the size of the sequence cannotexceed the number of symbols in the document. And the number of extractedelements is often much smaller.


a1

bY2

a3 @X4

bY5

@6 @7

a

bY

A E (4, X)

bY

@ @

a

F (4, X)(2, Y ) bY

@ @

a

F (4, X)(2, Y ) bY

C C

a

F (4, X)(2, Y ) F (5, Y )

G (5, Y )(4, X)(2, Y )

3 (2, Y ) 10 (2, Y ) 12(4, X)(2, Y )

Figure 4.14: An illustration of an extracting run of the CCM acceptor from Fig-ure 4.12 on a tree.

4.4.2 Single Run Extraction

We have seen how to get the list of extractions from a single marking of the docu-ment with a run on a CCM acceptor for the extraction task. To get the requestedextractions we will have to run over every possible marking of the document untilwe encounter the one marking that results in an accepting state of the CCM accep-tor. To be able to process all markings in a single run we collapse them in a singleversion of the document in which each element is marked nondeterministically.Each element can be either unmarked or be marked with one of the markers fromM . Each element is in the nondeterministic alphabet Σi =

⋃a∈Σi

markM (a). Bymaking the right choices, this version represents every possible marking of the doc-ument. Running the CCM acceptor on this nondeterministic input has a similareffect as if the α function of its transition function were nondeterministic. Hencethe nondeterministic run will return the result for one of the possible markings ofthe document. To cope with the nondeterminism in the input, we use the sametechnique as described in Section 3.2.4 and Section 3.3.6, to obtain a deterministicresult.

When processing a nondeterministic string with a given string acceptor A =(ΣM ,Σo, Q, q0, δ, φ), A will behave nondeterministically. We create a compositerepresentation A, for an acceptor that behaves deterministically on such input.This representation is defined as A = (ΣM ,Σo, Q, q0, δ, φ), in which Q = 2Q,q0 = {q0}. The definition of the new transition relation becomes: δ(q, a) = qn

with qn = {qn ∈ Q | ∃δ(q, a) = qn and q ∈ q and a ∈ a}. The output function φ is


0 @0

+1

@2

?3

@4

-5

31 (0, P )

2 (0, N)+1

@2

?3

@4

-5

0 (0, P ) @2

?3

@4

-5

31 (2, P )

(0, P )

(0, P )

(0, P )

2 (2, N)?3

@4

-5

0 (0, P ) @4

-5

31 (4, P )

(0, P )

(0, P )

(0, P )

2 (4, N)-5

0 (4, N)(0, P )

Figure 4.15: Extracting in a single run over all markings of the string ‘@+@?@−’.

defined for every a ∈ Σ as φ(a) = accept ⇔ ∃a ∈ a : φ(a) = accept.The deterministic version of the CCM acceptor (over nondeterministic input),

will process all possible markings in parallel. When we combine this with thetechnique described in Section 4.4.1, the lists of extractions for every possiblemarking are kept in parallel (and discarded as soon as a dead state is reached forthat particular marking). The final composite state that is reached for a givenstring, can contain many different sequences of extractions. But as the underlyingacceptor is a CCM acceptor, it contains maximally one sequence associated withan accepting state.

It is not practical to create this deterministic automaton, that keeps track ofextractions, explicitly, as it needs to be created anew for every possible input.We will therefore execute this automaton based on the composite representation,generating transitions on the fly.

Example 4.13 We run the extracting CCM acceptor from Example 4.11 over allpossible markings of the same string ‘@+@?@−’. This run is shown in Figure 4.15.The first symbol @ can be interpreted as @, @P , and @N . Therefore the transitionleads to three parallel states. From these states, the states 3 and 2, lead to the deadstate, given the symbol + (see Figure 4.13.a). Hence this parallel states collapseto a single state after processing the second symbol. This state contains the onlylegal set of extractions.

We see that after the third element is processed, the 3 alternative sub-solutionsshare the same tail: (0, P ). Although in the picture this tail is duplicated (forclarity), in an implementation the tail will point to the same list element. Thismemory optimization is the reason for the backward order in which sequences areconcatenated in Section 4.4.1.

An intuitive explanation, of what the algorithm does, is that it postpones thedecision of how an element is marked by considering the different possible markingsin parallel. A marking is eliminated as soon as the automaton reaches a reject state


a1

b2

a3 @4

b5

@6 @7

a

b

A

D (3, X)

C

E (4, X)

b

@ @

a

F (4, X)(2, Y ) b

@ @

a

F (4, X)(2, Y ) b

C

E (6, X)

C

E (7, X)

a

F (4, X)(2, Y ) F (5, Y )

G (5, Y )(4, X)(2, Y )

Figure 4.16: Extracting in a single run over all markings of a tree.

for it. When the automaton is a CCM acceptor, only a single marking is acceptedin the final state.

The approach for FTA’s is similar. Both string and tree states in the compositerepresentation are set of parallel states, associated with a list of extractions.

Example 4.14 In Figure 4.16, the single run extraction with the CCM acceptorfrom Figure 4.12. The final result is the same as the result for the completelycorrect marking, as calculated in Figure 4.14.

4.4.3 Complexity

The algorithm for extraction in a single run, as described in previous section,needs to run only once on a document, compared to m × n for the algorithmusing PCM acceptors (see Section 4.2.2), with m the number of different markersin M , and n the number of elements in the document. However, the complexityof these runs differs. Therefore, a better comparison is based on the number oftransitions. In the single run algorithm, a single δ-transition consists of multipleδ-transitions, one for each q in a composite state q. We start with the string case.For the algorithm using PCM acceptors, there is one δ-transition per element ofthe string, hence, for m× n runs, we have m× n2δ-transitions in the worst case.With some optimization this reduces to m × n × (n + 1)/2, still O(m × n2). Forthe single run algorithm, nδ-transitions are needed. The number of δ-transitionsis hard to calculate in general. We prove below that the number of δ-transitionshas an upper-bound that is linear in n.


1 @0

@1

@2

@3 1

2 (0, X) @1

@2

@3

12 (1, X)

3 (0, X)@2

@3

12 (2, X)

3 (1, X)@3

12 (3, X)

3 (2, X)

a) b)

Figure 4.17: An automaton extracting the 2nd last element of a string. The runover a string of length n = 4 needs 9 δ-transitions.

The number of δ-transitions in the δ-transition that reaches a certain complexstate q depends on the number of simple states contained in q. We prove thatfor a CCM acceptor, every q ∈ Q contains maximally one occurrence for everyq ∈ Q. Suppose a complex state q would contain more than one occurrence ofa state q, let’s say (q, E1) and (q, E2). For every sequence that gets acceptedstarting from q, there will be two markings accepted. One that has at least theelements of E1 extracted, and one that has at least the elements of E2 extracted.A CCM acceptor accepts maximally one marking per document. This entails thatE1=E2 and therefore (q, E1)=(q, E2), which proves our assumption. Hence, everycomplex state q contains maximally #Q (the number of states in Q) simple states.Therefore the number of δ transitions is smaller than or equal to #Q × n. Thisupper-bound can be lowered slightly by taking m into account. In Example 4.15,we give an example of the worst case for m=1. In this example, the number ofδ-transitions is #Q× n −((#Q− 1)×#Q)/2.

Example 4.15 We introduce the following extraction task: given the alphabetΣ = {@} and set of markers M = {X}, extract the n-th last element of everystring. In Figure 4.17.a, a solution is given for n=2. The given CCM only acceptsthose strings that have only the second last element of the string marked with X.Fig. 4.17.b illustrates the run of this CCM acceptor over the string @@@@. Theextracted element is (2, X), as state 3 is an accepting state.

In acceptors based on k-local string inference (see Section 5.1), different markersfor an element can persist for at most k-steps, because the automaton is notexpressive enough to use evidence for some marker that is farther away. In theworst case, this is when every wrong marker is eliminated after k steps, the numberof constituent states in a composite state is maximally (m + 1)k. Therefore in thecase of k-local automata, the upper-bound becomes (min((m + 1)k,#Q)× n).

In the case of trees, the result is similar (a linear upper-bound). We count oneδT -transition for each node, and one δS-transition for each node except the rootnode. The number of δT -transitions is smaller than #QT × n, where #QT is the


data set + field Single Run (sec.) PCM (sec.) speedup factor mean #nbigbook name 3.6 ±0.1 82.3 ±0.1 22.86 505.98bigbook address 3.3 ±0.1 81.5 ±0.1 24.70iaf organization 0.25 ±0.01 3.48 ±0.01 13.92 450.70iaf alt. name 0.25 ±0.01 3.46 ±0.01 13.84okra name 3.3 ±0.1 153.3 ±0.1 46.45 388.38

Table 4.1: Timings on RISE data sets

number of FTA states. The number of auxiliary FSA transitions is smaller than#QS × (n− 1), where #QS is the number of FSA states.

For both strings and trees we found a linear upper-bound for the number ofbasic transitions. The factor can be quite large. In practice, different alternativemarkings do not persist very long and the actual factor is much lower than theupper-bound, cfr. the experimental results in the next section.

4.4.4 Experiments

We choose the following data sets to evaluate the extraction algorithm:

• The Bigbook dataset: 235 webpages. We extract the ‘name’ and the ‘address’field. Both fields occur 4299 times in the dataset.

• The IAF dataset: 10 webpages. We extract the ‘organization’ field, thatoccurs 94 times, and the ‘alt.name’ field, that occurs 12 times. This is dataset S11 from the WIEN data sets.

• The Okra dataset: 252 webpages. We extract the ‘name’ field, 3335 oc-currences. The other fields in this dataset are: ‘score’, ‘date’, and ‘mail’.Together all these fields amount to 13340 extracted elements.

The experiments were carried out on a Pentium II 400MHz processor, with 128MB of RAM. The implementation is coded in Java (version 1.4.1). In the firstexperiment we used the algorithm from (Kosala et al. 2003) to infer wrappers forthe datasets (where needed these wrappers were converted to CCM acceptors).The generated wrappers are all 100% accurate. With these wrappers we extractedthe datasets with both the single run algorithm and with the algorithm based ona PCM acceptor, needing multiple runs. The results can be found in Table 4.1(each experiment is run at least three times). To give an indication of the sizeof the pages in the tasks, the last column shows the mean, over all pages, of thenumber of nodes in the trees.

In a second experiment we used a set of manually crafted wrappers for the Okradataset. Mainly because the used inference algorithm cannot generate wrappers


Single Run (sec.) PCM (sec.) speedup factor mean #ntree name 2.8 ±0.1 138.0 ±0.1 49.29 388.38

all 3.5 ±0.1 × 1.25 641.5 ±0.1 × 4.65 183.29string name 9.2 ±0.1 344.9 ±0.1 37.49 622.38

all 12.6 ±0.1 × 1.37 1501.2 ±0.1 × 4.35 119.14

Table 4.2: Comparing single-field extraction to multiple-field extraction on OKRAset.

that combine multiple markers 1.This experiment compares between extractionwith m=1 (name-marker) and m=4 (all markers). We repeated the experimentsfor the html-pages represented as trees and as strings (with different wrappers).For strings, n means the number of elements in the string. The results are givenin Table 4.2. These results show that m has only a small impact on the run-time of our single run algorithm (increases of respectively 1.25 and 1.37 for m=4).The number of runs of the PCM acceptor increase by 4, yielding a slow-down ofrespectively 4.65 and 4.35 (due to differences in time needed for extracting thedifferent types of fields). When we compare the first row of Table 4.2 with thelast one of Table 4.1, we see that we get different timings for the same extractiontask. This is due to the fact that the manually crafted wrapper is more specializedtowards the trees in the dataset, while the learned one generalizes a bit more. Alsothe timings for tree automata in Table 4.2 are better than for string automata. Thereason is that FTA’s are better suited for structured documents and are thereforeless complex than FSA’s for the same extraction task.

In a last experiment we took a set of 12 documents from the Okra dataset. Weensured that the number of nodes in these documents were evenly distributed. Weextracted each of these documents a thousand times to get a reasonable estimatefor the running time of the Single Run algorithm. The results drawn in Figure 4.18show a nice linear behavior.

4.5 Summary

In this chapter we have shown how to use automata (both string and tree au-tomata) to extract information from a document. Hereto we defined marked doc-uments as documents with a marker on some of their elements, with a markedversion of a document being called a marking of that document. A connectionwith an extraction task is made by defining a document that is marked correctlywith regard to that extraction task. We defined different classes of acceptors thataccept different types of correct markings: completely correct markings (CCM),

1And at the time of these experiments the operations to combine wrappers were not fullyimplemented yet.

4.5 Summary 97

0 500 1000

# nodes0

5

10

15

20

25

30

35

mill

isec

onds

Timings okra, tree

allname

0 500 1000 1500 2000

# elements0

50

100

150

200

mill

isec

onds

Timings okra, string

allname

Figure 4.18: Experiments illustrating the linear complexity of single run extrac-tion.

partially correct markings (PCM), single marker correct markings (SCM), emptymarkings (EM), and empty or single marker correct markings (ESCM). Such ac-ceptors can be used to represent wrappers for information extraction. Which typeof acceptor is used, depends on the approach chosen for extraction.

We provided several operators to convert between different types of correctmarking acceptors. Besides, we presented an approach to combine wrappers fordifferent extraction tasks, allowing to extract multiple single fields, with the samewrapper. Based on CCM acceptors, we gave an efficient algorithm, to performextraction in a single run on the tree. This approach tries all possible markings inparallel, and is able to finish many parallel runs early on, thanks to the propertyof the CCM acceptor, that it accepts only a single marking per document.

Chapter 5

(k, l)-Contextual TreeLanguages

A formal language is a set of strings (or trees) defined by enumeration (finite sets)or by mathematical description. Examples of the latter are languages defined bya set of formal grammar rules, or the set of strings (or trees) accepted by a finitestate acceptor. To learn a language from examples means that the rules of thatlanguage or the acceptor associated with that language are learned (grammar orautomata induction) starting from a small set of training examples for which it isindicated whether they belong to the language or not.

A typical approach to learn a language from examples, is to start from anacceptor, accepting exactly the positive examples, and to apply subsequently gen-eralization operators, so that more similar strings (or trees) are accepted. Anexample of such a generalization operation merges two non equivalent states. Thispreserves all existing transitions (the resulting acceptor still accepts the stringsaccepted by the original acceptor), and introduces new possible paths, even loops(the new automaton defines a language that is a strict superset of the languagedefined by the original acceptor). Typically, only those generalization operations,not resulting in an acceptor that accepts a negative example, are allowed (Oncina1992; Parekh and Honavar 2000; Cicchello and Kremer 2003). In the absenceof negative examples there is no indication of when to stop generalizing; no wayto know whether the learned language is still too specific or already too general.This intuition was substantiated in (Gold 1967), implying that the class of regularlanguages is not learnable from positive examples only. This result spurred a myr-iad of papers with solutions based on either using statistical information (Deniset al. 1996; Parekh and Honavar 1997; Denis 2001) or on defining a subclass ofthe regular languages that is learnable from positive examples only. Examples ofsuch subclasses in the case of string languages are k-reversible languages (Angluin

99

100 (k, l)-Contextual Tree Languages

1982), uniquely terminating regular languages (Mkinen 1996), k-contextual lan-guages (Muggleton 1990) and k-testable languages (Garcıa and Vidal 1990). Asshown by (Ahonen 1996), the latter two are equivalent. We will refer to themas k-contextual languages. The subclass of k-contextual languages has the allur-ing property that one single example is sufficient to perform generalization. Alsofor the class of regular ranked tree languages, several subclasses learnable frompositive examples only, were proposed: k-testable tree languages (Garcıa 1993),reversible tree languages (Lopez et al. 2004), . . . For a more elaborate survey werefer to (Kosala 2003).

In this chapter we propose a new learnable subclass of the class of regular(unranked) tree languages, inspired by the class of k-contextual string languages.We therefore start with a brushup on k-contextual string languages. We thendefine and study the class of (k, l)-contextual tree languages, and continue withthe learning algorithm. Finally we discuss how a tree automaton can be generatedthat accepts a specific (k, l)-contextual tree language.

5.1 k-Contextual String Languages

We start with an intuitive introduction to contextual string languages based ona simplified representation. Later on we provide the definition of k-contextualstring languages, and we show how this relates to the simplified representation.We end with a discussion on some possible variations that alter the expressive-ness/learnability.

The basic assumption behind k-contextual string languages is that strings fromthe same language are constructed in the same way, more specifically, with thesame building blocks. The building blocks of a string are the symbols that it ismade up of. However using symbols as building blocks leads to overgeneralization,as the language will contain all strings based on the set of encountered symbols(Σ∗, if the example strings range over the whole alphabet). The generalizationis therefore constrained by restricting the minimal granularity of the buildingblocks. Given a finite alphabet, the number of possible building blocks of a givenmaximal size is finite. This implies that such a language is learnable from positiveexamples only, seeing that as soon as an example of each of the building blocks isencountered, the language is learned.

As building blocks of a string, we discern the substrings of that string of agiven length k. To learn, the building blocks from each of the positive examplesare collected into a representative set. To check whether some string belongs to thelearned language, it is sufficient to check whether its building blocks are a subsetof the representative set. For completeness sake: when a string itself is smallerthan k, the set of its building blocks is the singleton containing that string.

5.1 k-Contextual String Languages 101

Example 5.1 Starting from the string ‘abcbabcba’, we learn the associated lan-guages for k = 2 and k = 3, by collecting respectively the 2 and 3-substrings. Wecall these languages respectively L2 and L3. The resulting representative sets forthese languages are: {ab, bc, cb, ba} and {abc, bcb, cba, bab}. The string ‘ababab’,having as 2-substrings: {ab, ba}, is clearly an element of L2, while it is not anelement of L3, as it has a 3-substring aba, which is not in the representative setof L3.

5.1.1 Definitions

In (Muggleton 1990), the following definition is given for k-contextual languages:

Definition 5.1 (k-Contextual Language) A regular language L ⊂ Σ∗, is k-contextual if and only if ∀u1, u2, w1, w2, v ∈ Σ∗ : u1vw1 ∈ L and u2vw2 ∈ L andlength(v)=k ⇒ u1v \ L = u2v \ L.

An acceptorA is said to be k-contextual if and only if L(A) is k-contextual. We canrewrite Definition 5.1 into a definition for a k-contextual acceptor as follows. Sincew1 and w2 are not used in the implication we can write this as: ∀u1, u2, v ∈ Σ∗:∃w1, w2 ∈ Σ∗ : u1vw1 ∈ L(A) and u2vw2 ∈ L(A) and length(v)=k ⇒ u1v\L(A) =u2v \L(A). Given Definition 3.17, uvw ∈ L(A) is equivalent with φ(δ(q0, uvw)) =accept or φ(δ(δ(q0, uv), w)) = accept. According to the definition of a dead state(Definition 3.18), ∃w ∈ Σ∗ : φ(δ(δ(q0, uv), w)) = accept, is equivalent to: δ(q0, uv)is no dead state. Combining Propositions 3.1 and 3.2 shows that u1v \ L(A) =u2v \ L(A) is equivalent to δ(q0, u1v) ≡A δ(q0, u2v) or to δ(q1, v) ≡A δ(q2, v),for q1 = δ(q0, u1) and q2 = δ(q0, u2). This leads to the following definition fork-contextual acceptors:

Definition 5.2 (k-Contextual Acceptor) Given a regular acceptor A, definedby (Σi,Σo, Q, q0, δ, φ), A is k-contextual, if and only if, ∀q1, q2 ∈ Q,∀v ∈ Σ∗:δ(q1, v), δ(q2, v) are no dead states and length(v)=k ⇒ δ(q1, v) ≡A δ(q2, v).

Hence, when we take two arbitrary states from a k-contextual acceptor, and start-ing from these states, we process the same input, then we know that after k steps,the resulting states will be equivalent (unless one or both of them result in a deadstate). This implies that the state of a string is defined locally: on the final kinput symbols, or some earlier k symbols when they imply a rejection.

In (Ahonen 1996), the k-grams of a string or a set of strings are defined asthe substrings of length k, found in these strings, after a preprocessing step isperformed on these strings. This preprocessing prepends k− 1 ‘#’ symbols beforeeach string and appends a trailing ‘#’ symbol. Formally:

Definition 5.3 (k-grams) Let S be a set of strings over Σ, where # /∈ Σ. Theset of k-grams of S is defined as

k-grams(S) = {u | u is a substring of #k−1s#, length(u)=k, s ∈ S}.


Similar as with k-substrings (shown in the introduction of this section), a set ofk-grams can be used as a representative set, to define a language. In both casesthe set of allowed building blocks is given. The difference is that with k gramsit is possible to restrict the initial and final part of the strings of the language,because the k-grams containing an ‘#’ indicate the building blocks used for theextremities. This is illustrated in the following example.

Example 5.2 The language learned from the string ‘abcbabcba’ in Example 5.1(based on 2-substrings) accepts the string ‘ababab’. While the language learnedfrom ‘abcbabcba’, based on 2-grams, rejects the string ‘ababab’, as the set of itsbuilding blocks {#a, ab, ba, b#} is not a subset of the representative set {#a, ab,bc, cb, ba, a#}. One can say that the induction of languages based on k-gramsgeneralizes less than for those based on k-substrings, as the learned language is asubset of the latter. But on the other hand these languages are more expressive.

The extra expressiveness provided by the preprocessing step (adding #’s) results inlanguages equivalent to the family of k-contextual languages, as stated in (Ahonen1996), where it is proven that for every k-contextual language, a set of k+1-gramsexists that defines exactly the same language and vice versa. This implies that theclass of k-testable languages in the strict sense, as defined in (Garcıa and Vidal1990), are also equivalent to the class of k-contextual languages. This definitionis based on a set of k-substrings that are not allowed, and sets of prefixes (smallerthan k) and suffixes (smaller than k) that are allowed. Given that the numberof possible k-substrings is finite, the former part of the definition is equivalent toindicating the set of allowed substrings. For example the language (k = 2) from Ex-ample 5.1 can also be defined by the set of forbidden substrings {aa, ac, bb, ca, cc}.And the latter part of the definition is an explicit formulation of the restrictionson the initial and final parts, imposed by using the ‘#’ symbol.

5.1.2 Generalization Power versus Expressiveness

When choosing a subclass of languages that is learnable from positive examplesonly, there is always a trade off between the generalization power and the ex-pressiveness of these languages. It is nice to be able to learn a language fromas few examples as possible, but is also important that the target language canbe expressed. It is obvious that this choice is highly influenced by the applica-tion at hand. In this section we run through some variations on the definition ofk-contextual languages, and we discuss their influence on the expressiveness.

In contrast to the notion of k-testable string language “in the strict sense”as defined in (Garcıa and Vidal 1990), a more expressive notion of k-testablelanguages is studied in (McNaughton 1974). Instead of a single set of k-grams,a language is defined by multiple sets of k-grams, such that a string is acceptedif and only if the set of its k-grams, is a subset of at least one of those sets.During learning, the k-grams of the examples are not collected in a single set, but

5.2 (k, l)-Contextual Tree Languages 103

each example results in a separate set of k-grams. The learned language is thendefined by the collection of all sets from the examples. The k-testable languagesin the strict sense are not expressive enough to allow strings made from buildingblocks A and B, or building blocks made from B and C, and in the meantime toreject strings made from building blocks A and C. This is no problem for the moregeneral notion, as illustrated in the example below. The advantage of k-testablelanguages in the strict sense is that they need fewer examples to learn, as not everycombination of building blocks has to be present in a separate example.

Example 5.3 For k = 2, the examples ‘abcbcbcbd’ and ‘dbcbcbcba’ results ina collection of two sets of substrings of length 2 (this example can be extendedto 2-grams): {{ab, bc, cb, bd}, {db, bc, cb, ba}}. The language defined by this repre-sentative set is expressive enough to reject strings ‘ababa’ and ‘bdbdb’, unlike the2-testable language in the strict sense learned from the same two examples. Buton the other hand, not enough examples are presented to generalize enough suchthat strings like ‘abcbcbcba’ and ‘dbcbcbd’ are accepted (unlike with languages inthe strict sense).

As shown in Example 5.2, the expressiveness can be increased (at the expenseof generalization power) by placing extra restrictions on the building blocks. Inthe case of k-contextual languages, these are restrictions on initial and final sub-strings. A compromise between the expressiveness of k-contextual languages, andlanguages based on k-substrings (without restrictions), restricts only the final sub-strings. While our interest does not lie in string applications that benefit from this,we mention it here because we will encounter a similar approach in the tree case.For a single sided restriction we propose a third option to implement this restriction(next to an explicit set of allowed final substrings or adding k− 1 # symbols afterthe string). We define the building blocks of a string as its k-substrings togetherwith its suffixes smaller than k. The building blocks smaller than k are parts fromthe tail of the string, and have to match with the smaller building blocks in therepresentative set, which restricts the possible suffixes. This is illustrated in theexample below.

Example 5.4 This new definition of building blocks, results in the following setfor the string ‘abcbabcba’ (and k = 3): {abc, bcb, cba, bab, ba, a}. The resultinglanguage accepts amongst others strings ‘cbabcba’, ‘bcba’, and ‘bcbabcbabcba’, butrejects the string ‘abcbab’, because its building blocks ab, and b are not containedin the representative set.

5.2 (k, l)-Contextual Tree Languages

To define a subclass of the regular tree languages, we start with defining a type ofbuilding blocks for regular trees. Based on these building blocks the new subclass,(k, l)-contextual tree languages is defined, similar to the k-contextual languages.


The set of (k, l)-roots of a tree f(t1 . . . tn) is the singleton {f} if l=1; otherwise,it is the set of trees obtained by extending the root f with (k, l − 1)-roots of ksuccessive children of t (all children if k > n). Formally, we have the followinginductive definition. For sets S1,. . .Sn of trees, we use the notation f(S1 . . . Sn)for the set of trees {f(s1. . . sn) | si ∈ Si}.R(k,l)(f(t1 . . . tn)) =

{f} if l=1f(R(k,l−1)(t1) . . . R(k,l−1)(tn)) if l>1 and k>n⋃n−k+1

p=1 f(R(k,l−1)(tp) . . . R(k,l−1)(tp+k−1)) otherwise.

As an extension, the (k, l)-roots of a set T of trees are defined as R(k,l)(T ) =⋃t∈T R(k,l)(t). Finally, a (k, l)-fork of a tree t is a (k, l)-root of any subtree of

t. Thus, the set of (k, l)-forks of t, denoted by F(k,l)(t), is the collection of the(k, l)-roots of the subtrees of t, in each position of t: F(k,l)(t) =

⋃p∈P(t) R(k,l)(t/p).

The (k, l)-forks of a set of trees T are then defined as F(k,l)(T ) =⋃

t∈T F(k,l)(t).Given these definitions we can state that R(k,l)(T ) and F(k,l)(T ) are monotone

in T , or formally:

Proposition 5.1 T ⊇ T ′ implies R(k,l)(T ) ⊇ R(k,l)(T ′) and F(k,l)(T ) ⊇ F(k,l)(T ′)

Example 5.5 We graphically show an example tree t in Figure 5.1. Next to itits (2, 3)-forks can be found. The first 6 of these forks, are the (2, 3)-roots of t.The other forks are the (2, 3)-roots of the subtrees of t. Note that the (2, 3)-rootsof a leaf of a tree, is the singleton containing that leaf itself.

k

e f g

b

h i j

c d

at

e f h i

b c

a

e f i j

b c

a

f g h i

b c

a

f g i j

b c

a

h i

c d

a

i j

c d

a

k

e f

b

k

f g

b

h i

c

i j

c d e

k

f g h i j k

Figure 5.1: The (2,3)-forks of a tree t.

Definition 5.4 The (k, l)-contextual tree language based on the set G of trees isdefined as L(k,l)(G) = {t ∈ T (Σ) | F(k,l)(t) ⊆ G}.

Every given (k, l)-contextual language can be defined by an infinite number ofdifferent sets. For example adding elements to or removing elements from G witheither height > l or width > k will not influence the definition of L(k,l)(G):

5.2 (k, l)-Contextual Tree Languages 105

Proposition 5.2 L(k,l)(G) = L(k,l)(G ∩ T(k,l)) with T(k,l) the set of trees withheight at most l and width at most k.

Note that T(k,l) is finite, so we can always assume that G is finite. But evenaddition or removal of trees with height ≤ l and width ≤ k will not always influencethe definition of L(k,l)(G). The following proposition shows that there is always aunique smallest G:

Proposition 5.3 If L is (k, l)-contextual, then NL := F(k,l)(L) is the smallestset G (with respect to set inclusion) such that L = L(k,l)(G). We call NL therepresentative set for L.Proof First, we show that L = L(k,l)(NL). The inclusion from left to right istrivial. For the converse inclusion, we know that L = L(k,l)(G) for some G (sinceL is given to be (k, l)-contextual). Clearly, NL ⊆ G for any such G. Hence, ifF(k,l)(t) ⊆ NL for some tree t, then also F(k,l)(t) ⊆ G and thus t ∈ L.

Since we observed that NL ⊆ G for any G such that L = L(k,l)(G), the mini-mality of NL is established as well and the proposition is proved. �As an immediate corollary we obtain:

Corollary 5.1 For any two (k, l)-contextual languages L1 and L2, we have L1 ⊆L2 if and only if NL1 ⊆ NL2 .Proof The implication from left to right is trivial. For the other direction, wehave L1 = L(k,l)(NL1) ⊆ L(k,l)(NL2) = L2. �

Example 5.6 To illustrate these definitions we show an example with (k, l) =

(2, 2). Given a set of trees G =

b ,b c

a,

d b

a,

b b

c,

c b

b a

a , the associated

language is L(2,2)(G) =

b ,b b

c,

b b b

c,

b b b b

c, . . . ,

b b

b c

a

,

b b b

b c

a

,

b b b b

b c

a

, . . .

.

And the representative set for this language is N =

{b ,

b c

a,

b b

c }. The tree

t =d b c

a/∈ L(2,2)(G) because F(2,2)(t) =

{b , c , d ,

b c

a,

d b

a }6⊆ G. Note that

N 6⊆ L(2,2)(G).

This definition is a generalization of k-testable languages in the strict sense, be-cause we use only a single set of (k, l)-forks. Experimental results have shown that


for the particular application of information extraction the expressiveness of (k, l)-contextual tree languages “in the strict sense” is sufficient, leaving the freedom tochoose for more generalization power.

Another side remark is that the definition of (k, l)-contextual tree languagesdoes not place explicit extra restrictions on the occurrence of forks. The definitionof (k, l)-forks does actually introduce an implicit restriction on the ‘leaf’ forks,similar to the single sided restriction on strings discussed in Section 5.1.2. Thebottom elements of the tree will be split up in building blocks with smaller depththan l, which discriminates the use of certain forks as bottom forks.

Example 5.7 The tree t =b c

ais not an element of the language L(2,2)(G) defined

in Example 5.6. The set of (2, 2)-forks of this tree is F(2,2)(t) =

{b , c ,

b c

a }. As

c is a leaf of t it is included in F(2,2)(t). Because c does not occur in G, the fork

b c

ais prevented from occurring as a bottom fork.

The current definition of (k, l)-contextual tree languages is sufficiently expressivefor our needs. In case more expressiveness is needed, one possibility is to restrictthe top, bottom, left, and right forks. Explicit sets of allowed forks are cumber-some, because the sets of allowed left and right forks have to be checked for everysubtree. A more straightforward approach uses a preprocessing step on the treesbefore the regular use of the (k, l)-forks (similar to the preprocessing step in thedefinition of k-grams). This transformation is a transformation HV of the tree,defined as HV(t)=V(H(t)), where H and V are the horizontal and the verticaltransformation. These are defined respectively as

H(f) = f

H(f(t1 . . . tn)) = f(# H(t1). . . H(tn) #)

and

V(t) = #(V’(t))

V’(f) = f(#)

V’(f(t1 . . . tn)) = f(V’(t1). . .V’(tn))

V’(#) = #

with f ∈ Σ, # /∈ Σ and t1 . . . tn ∈ T (Σ ∪ {#}). The altered tree language is thendefined as L(k,l)(G) = {t ∈ T (Σ) | F(k,l)(HV (t)) ⊆ G}.

5.3 Learning (k, l)-Contextual Tree Languages 107

Example 5.8 Given a tree t =b c

a, the transformation H(t) results in t’=

#b c#

a,

V(t) yields

##

b c

a

#

and V(H(t)) = V(t’) gives

##

#b c#

a

#

.

5.3 Learning (k, l)-Contextual Tree Languages

To learn a (k, l)-contextual tree language from a set of positive examples E, wecollect the (k, l)-forks of these examples and use them as the representative set forthe language to be learned. In other words, we assume the language to be learnedequals L(k,l)(F(k,l)(E)) (see Algorithm 5.1). Note that the representative set forthis language equals F(k,l)(E). This way, overgeneralization is avoided as, for agiven k and l, the algorithm finds the most specific (k, l)-contextual language thataccepts all the examples.

Algorithm 5.1 learnWrapperInput: The set of positive examples E , and the parameters k and l.Output: The learned (k, l)-contextual language.1: Forks = ∅2: for each Example ∈ E do3: Forks = Forks ∪ F(k,l)(Example)4: end for5: return L(k,l)(Forks)

We note that this learning method is anti-monotonic in the parameters k and l:

Proposition 5.4 If k′ ≥ k and l′ ≥ l then L(k′,l′)(F(k′,l′)(E)) ⊆ L(k,l)(F(k,l)(E)).Proof Let t be a tree such that F(k′,l′)(t) ⊆ F(k′,l′)(E). We must show thatF(k,l)(t) ⊆ F(k,l)(E). Consider a (k, l)-fork r of t. Then r can be extended to a(k′, l′)-fork r′ of t (it is possible that r′ equals r). By the given, r′ also appears asa (k′, l′)-fork of some tree t′ in E. But then r appears as a (k, l)-fork of t′ as well,and thus r ∈ F(k,l)(E) as had to be shown. �This anti-monotonicity is typical for local languages such as k-contextual lan-guages (Muggleton 1990; Ahonen 1996), k-testable languages (Garcıa and Vidal1990), and k-testable tree languages (Garcıa 1993; Knuutila 1993).

In the remainder of this section we formally prove that the class of (k, l)-contextual tree languages is learnable in the limit (Gold 1967) from positive ex-


amples only. This proof is structured in the same way as similar proofs in (Angluin1982; Muggleton 1990). Hence we start by proving that there exists a characteristicsample for every language.

Definition 5.5 A characteristic sample of a (k, l)-contextual tree language L isa finite subset S of L, such that L is the smallest (k, l)-contextual tree language,given k and l, that contains S.

Proposition 5.5 Every (k, l)-contextual tree language has a characteristic sam-ple.Proof Call two trees t1 and t2 “equivalent” if F(k,l)(t1) = F(k,l)(t2). Sincethere are only finitely many different trees of width k and height l, there are alsoonly finitely many different sets of such trees, and as a consequence, there areonly finitely many different equivalence classes. Moreover, any (k, l)-contextuallanguage L is closed under equivalence, i.e., can be written as a (finite) union ofequivalence classes. We now claim that it suffices to pick a representative fromeach class in L to obtain a characteristic sample S for L.

To prove this claim, consider any other (k, l)-contextual language L′ such thatL′ ⊇ S. We show L ⊆ L′. Thereto, let t ∈ L. Then t is equivalent to somerepresentative t′ from S. Since S is contained in L′ and L′ must be closed underequivalence, also t ∈ L′ as desired. �A positive presentation of a language L is an infinite sequence of trees T =t1, t2, t3, . . ., such that every element of the sequence is an element of L and viceversa. We define an inference operator KL, which given an infinite sequence oftrees t1, t2, t3, . . . and parameters k and l, produces an infinite sequence of treelanguages L1, L2, L3, . . . in which Ln = L(k,l)(F(k,l)({t1, t2, . . . , tn})) for all n ≥ 1.

Observe, by Proposition 5.3, that Ln ⊆ L for each n, if L is (k, l)-contextual.The following proposition now shows that (k, l)-contextual languages are indeedidentifiable in the limit:

Proposition 5.6 If L is (k, l)-contextual, then Ln = L for n sufficiently large.Proof Let n be sufficiently large such that {t1, . . . , tn} includes a characteristicsample S of L. Then Ln is a (k, l)-contextual language containing S and thusLn ⊇ L. Since also Ln ⊆ L, we conclude Ln = L. �

Example 5.9 The language L2,2(G) in Example 5.6 is divided in 3 equivalenceclasses. Taking a representative of each class results in the following characteristic

sample:

b ,b b

c,

b b

b c

a. We can reduce this to

b b

b c

a, as the forks of the

other representatives are all subsets of this single tree.

5.4 Learning (k, l)-Contextual Tree Acceptors 109

Hence the class of (k, l)-contextual tree languages is learnable from positive ex-amples only. Choosing larger parameters allows for more expressive languages, atthe cost of bigger representative sets and bigger characteristic samples for theselanguages. For learning we want to choose the parameters small, to minimize thenumber of examples needed, but large enough such that the resulting language isexpressive enough for the problem at hand.

5.4 Learning (k, l)-Contextual Tree Acceptors

One can query whether a tree is a member of a given (k, l)-contextual tree languageby collecting the (k, l)-forks of that tree and checking whether the resulting set isa subset of the representative set of that language. Much more efficient is to usea tree acceptor accepting exactly that language to perform this query. Such anacceptor is called a (k, l)-contextual tree acceptor:

Definition 5.6 ((k, l)-Contextual Tree Acceptor) A regular tree acceptor Tis (k, l)-contextual, if and only if L(T ) is (k, l)-contextual.

In this section we will show how to learn such an acceptor directly from a set ofexample trees, such that also the time needed for learning is improved. This directapproach will at the same time reduce the memory needs of the learning phase.The learning of the (k, l)-contextual tree acceptor is split in two steps. First, anacceptor is created (starting from the tree examples), that accepts the set of (k, l)-forks of the examples. Second, a conversion algorithm is run that converts thisfork set acceptor into a (k, l)-contextual tree acceptor that accepts the languagedefined by that set of forks. We start by introducing a small example of a (k, l)-contextual tree language, that we will use further on to illustrate the algorithmsin each of these steps.

b c b c

a

b c b c b

a

t

a

b c b

a

a

c b c

a

a

b c b

a

b c b

a

b

c b c

a

b

b c b

a

c b c

a b c b c b c b c b c b

a

c b c

a

c b c

a

c b c

a

Figure 5.2: A tree t with on the left the representative set of (3, 3)-forks of thattree, and on the right two trees that belong to the language learned from t.


Example 5.10 Figure 5.2 shows the (3, 3)-forks of a tree t. The first 3 of theseforks, are the (3, 3)-roots of t. The trees on the right belong to the languageL(3,3)(F(3,3)({t})). The set of (3, 3)-forks of the left tree consists of the four lastforks on the second row. The set of (3, 3)-forks of the right tree consists of thesecond fork on the first row, and the first and the three last forks on the secondrow. These sets are clearly subsets of the (3, 3)-forks of t. These two trees alsoillustrate, that with a small set of forks of limited size, it is possible to build treesof arbitrary length or height.

5.4.1 Fork Set Acceptor

We show first how to construct a fork set acceptor from a set of forks, based on ageneral, incremental construction algorithm. Later on we introduce an approachthat can learn the fork set acceptor for the (k, l)-forks of a given tree, directly fromthat tree.

5.4.1.1 Incremental Construction

We define an incremental constructing framework for tree acceptors. Initially thisframework starts from a tree acceptor that rejects everything. During construction,a tree can be processed by that acceptor. If the tree is accepted, the acceptor staysunchanged, otherwise the acceptor is adapted such that it accepts the same treesas before, plus that extra tree.

The initial acceptor is T0 = (Σi,Σo, QT , δT , φT ), with input alphabet Σi = ∅,output alphabet Σo = {accept, reject}, and set of tree states QT = φT = ∅. Theautomaton AT = (QT , QT , QS , α, δS , φS) with QS = α = δS = φS = ∅ is an FSAused to represent δT . We define alternative functions α′, δ′S , and φ′S for respectivelyα, δS , and φS , that are to be used during the construction phase. These functionsalter α, δS , and φS , when needed to accept a newly given tree. After constructionthe final α, δS , and φS are used. These functions are defined as follows:

• function α′ is defined such that α′(a) = α(a) for every a ∈ Σi. For everya /∈ Σi, the following side effects are performed: a is added to Σi, a newstring state s is added to QS , and α is extended with α(a) = s; finally thestate s is returned as function value.

• function δ′S is defined such that δ′S(s, t) = δS(s, t), when the latter is defined.Otherwise a new string state q is added to QS , δS is extended with δS(s, t) =q, and q is returned as function value.

• function φ′S is defined such that φ′S(s) = φS(s) when defined, otherwise anew tree state t is added to QT , φT is extended with φT (t) = reject, φS isextended with φS(s) = t, and t is returned as function value.

Figure 5.3: The fork set acceptor built from the (3, 3)-forks of tree t from Exam-ple 5.10.

To construct the fork set acceptor for a given set of forks, an initial acceptor iscreated. Then, for every fork in the example trees of our language, we call δ′T (thisis the alternative transition function defined by α′, δ′S , and φ′S) and change theoutput of the final tree state of the fork from reject to accept. As forks from thesame example tree have parts in common, some redundant work is done.

5.4.1.2 Constructing Directly from a Tree

We now want to learn the fork set acceptor of the tree. And we want to processeach node only once, even when it belongs to several forks at once. We willtherefore process several forks in parallel. The method GetForks is a recursivefunction that is passed a tree (we call the root of that tree, the current node),and does two things: it adapts the acceptor to accept all forks in that tree, and itreturns a result that is used in a previous call of GetForks (on a tree that has thecurrent tree as subtree), to adapt the acceptor to accept the forks higher up. Thisresult is an array of l sets of output states. For i < l, the ith element containsthe tree states associated with all the (k, i)-roots of the input tree. These rootsare parts of the (k, l)-forks of the (l − i)th ancestor of the current node. The lth

element contains the tree states associated with all the (k, l)-forks of the inputtree. The first call to GetForks will return in this element, the output states forall (k, l)-forks of the original tree. The output of these states is set to accept.

In GetForks, the results for each of the children1 of the tree are calculated(recursively) only once. These results are then reused in the different combinationsin which they form the (k, p)-roots of a node (for 2 ≤ p ≤ l). The states of the(k, p)-roots of a child of the current node are used to find the states of the (k, p+1)-roots of the current node. If such a state does not exist yet, it is added, and thisway the acceptor is incrementally constructed. This method is shown in pseudocode in Algorithm 5.2. The automaton constructed by applying this algorithm onthe tree t from Example 5.10, is shown in Figure 5.3.

1The function map, used in Algorithm 5.2, is the same function as used in Definition 3.28.

Algorithm 5.2 GetForksInput: A tree t = f(w), the values k and l, and a reference to an automaton

T = (Σi,Σo, QT , δ′T , φT ) that gets updated as a side effect, to form the endresult: an automaton that accepts also the (k, l)-forks of tree t.

Output: An array result of l elements; result[i] (i < l) is the set of states fromthe (k, i)-roots of the input tree; result[l] the set of states from the (k, l)-forks.This is an auxiliary result used in the recursive call.

1: len=length(w)2: if len == 0 then3: for i = 1 to 2 do4: result[i] = {φ′S(α′(f))} // each (k, i)-root equals the (k, 1)-root5: end for6: else7: for i = 2 to l − 1 do8: result[i] = ∅ // initialization9: end for

10: result[1] = {φ′S(α′(f))} // the (k, 1)-root of t is the only (k, 1)-fork

11: result[l] = ∪c∈childstatesc[l] // the (k, l)-forks from the subtrees of t12: define RecFunc(x) as GetForks(x, k, l)13: childstates = map(RecFunc,w)14: if len < k then15: sequences = {childstates}16: n=len17: else18: sequences = sublist(k, sequences) // returns a set with as elements all

sublists of k successive elements19: n=k20: end if21: for all (r1, . . . , rn) ∈ sequences do22: for i = l downto 2 do23: for each tuple tup=(t1, . . . , tn) such that ∀j : tj ∈ rj [i− 1] do24: rootstate = φ′S(δ′S(α′(f), tup))25: result[i] = result[i] ∪ {rootstate}26: if i == l then27: set φT (rootstate) = accept28: end if29: end for30: end for31: end for32: end if


5.4.2 Conversion to (k, l)-Contextual Tree Acceptor

To check whether a tree belongs to a (k, l)-contextual tree language, we collect allits forks, and check whether they all belong to the representative set (a fork set),or we check whether they are all accepted by the fork set acceptor associated withthe representative set.

In horizontal direction, a fork can overlap with k − 1 children of an adjacentfork, while in vertical direction a fork, starting at a given node, overlaps with the(k, l − 1)-root of a fork starting at a child of that node. When collecting all forksof a tree, this overlap results in several copies that are made of each node, becausethey belong to different overlapping forks. When we want to check whether allforks of a tree are accepted by a fork set acceptor however, we do not need tocollect the set of all (k, l)-forks explicitly. We can perform a single run (bottom-up) over the tree, in which each node is accessed only once, and processed bydifferent parallel runs. Each run corresponds to a specific fork to which that nodebelongs. When a node is processed as a leaf, or as the leftmost child of a nodeof a fork, a new parallel run is started. The existing runs process the node asan internal node of a fork. The runs that have finished a fork will expire. Thisentails that the number of parallel runs is bounded, as the fork set acceptor doesnot contain loops and the branching factor of the accepted trees is limited by k.Also, the number of transitions of the tree automaton is bounded by l. As soonas one of the forks is rejected, the overall result will be reject.

A single run, using parallel processes still requires that a single node getsprocessed multiple times. We introduce in this section, an operator to convert afork set acceptor to a (k, l)-contextual acceptor for that fork set, that processeseach node only once. Furthermore, we discuss optimizations for the representationof the data structures in this operator.

5.4.2.1 The Conversion Algorithm

We use the framework from Section 3.3.4, to define an operator that takes as input,some k and l parameters and a fork set acceptor (for the given k and l), and thatreturns a single tree acceptor, accepting the (k, l)-contextual language that hasthe set, associated with the fork set acceptor, as representative set. We thereforespecify a composite representation that simulates the parallel runs described above,and we define the 4 functions needed in the framework to define the operator.We first discuss the rationale behind the composite representation of our targetacceptor, before we detail it.

Remember that the (k, l)-forks of a tree are the (k, l)-roots of each of its sub-trees. And the (k, l)-roots of a tree are formed from the root element of the treetogether with the (k, l − 1)-roots of k consecutive children of that tree.

Example 5.11 Having a look at Figure 5.1, we see that the (2, 3)-roots of tree t,

are made up of the (2, 2)-roots of its children. These (2, 2)-roots aree f

b,

f g

b,h i

c,

i j

c, and d. Similarly these (2, 2)-roots are build with the (2, 1)-roots of the children

of nodes b and c, being e, f, g, h, i, and j.

A single node of a tree will be a part of several forks of the tree. It will be thetop of some (k, l)-forks (hence (k, l)-roots of that node). It will be the top of some(k, l − 1)-roots, that are part of (k, l)-forks that start at the parent of that node.It will be the top of some (k, l − 2)-roots, that are part of (k, l)-forks that startat the grandparent of that node. And this goes on. It will be the top of some(k, 1)-roots, that are the leafs of forks originating at an ancestor of the (l − 1)thdegree.

Example 5.12 In Figure 5.1, the node f has a single (2, 3)-root:k

f, which is one

of the forks of tree t. This is also the single (2, 2)-root of node f, and it is part ofthe two forks starting at its parent node b. The (2, 1)-root of node f, being f, is theleaf of several of the forks starting at node a.

When we process a node from a tree with parallel runs over the fork set acceptor,we will not only calculate the tree states resulting from processing its (k, l)-rootswith the fork set acceptor, but also the tree states resulting from its (k, j)-roots,with j < l, as these tree states might be used to calculate the tree states for(k, l)-roots of some ancestors of that node.

As composite representation for the tree states, we use therefore a one di-mensional array of sets of tree states from the original fork set acceptor. Theposition in the array indicates the depth j, of the (k, j)-root of the node beingprocessed. Therefore the set of tree states at position j, contains the tree statesof all (k, j)-roots that have the current node as root node.

Remember also that to calculate the tree state of a tree, given some tree ac-ceptor, we run the string automaton representing the transition function of thattree acceptor. We start from the initial state associated with the label of its root.From there we take the transition, given the tree state of its first child. From theresulting string state, we continue with the tree state of its second child, and soon.

When we process a node from a tree with parallel runs over the transitionfunction (string automaton) of the fork set acceptor, we will process its childrenone by one. These runs are divided by level. To calculate the tree states ofthe (k, j)-roots of a node, the (k, j − 1)-roots are given as input to the stringautomaton, and this for every level j. But on the same level we can divide theruns even further. When processing a node as a child of a (k, j)-root, this node


can be the first child of a root, or the second child of another root, . . . , or the kthchild of still another.

Furthermore, when we process the ith child of some (k, j)-root of a given node,that child can have multiple (k, j − 1)-roots, resulting in multiple tree states forthat child. We have already modelled this in the composite tree representation,as it keeps a set of tree states for each level. Processing each of these tree stateswith the transition function of the fork set acceptor, can result in different states.Therefore we will represent the parallel runs on the string automaton, for a levelj, where a child is processed as the ith child, as a set of string states.

Example 5.13 When processing node c in tree t from Figure 5.1, as the first

child of the (2, 3)-root of node a, we see that node c has two (2, 2)-roots:h i

cand

i j

c. The tree states for these roots are then used to make transitions starting from

the initial state associated with c. Leading to a set of next string states.When processing the (2, 3)-root of node a, before we process its child, node c,

we find a parallel process where the child b is already processed as first child of the(2, 3)-root (or better (2, 3)-roots, as node b has also multiple (2, 2)-roots). Startingfrom the set of string states from this process implies that we will process node cas second child of these roots.

The composite representation of a string state is therefore a two dimensional array,where the elements of the i-th column represent the string states of the originalfork set acceptor, reached by processing the current child as the i-th child of a fork,and the elements of the j-th row represent the states reached by processing thecurrent node as being a (k, j)-root. The elements of the k-th column contain thestates for the finished (k, j)-roots, not only those reached by the last transaction,but also those in previous transactions. We let the columns start at position 0.This column contains the initial state associated with the label of the current node.This implies that the resulting automaton will be split into disjoint graphs, onegraph per initial state. This might change after minimization.

Note that for the first k−1 children of the current node that are processed, notall positions in the root are possible. The first child of a node cannot be the secondchild of a fork starting in the same node. Therefore the k−1 states reachable froman initial state will have less than k columns (we denote the number of columnswith maxCol). It is possible that, even before k children are processed, a validroot is found (when the fork has less than k children).

We will now discuss and define each of the functions needed in the framework,to define the operator to convert a fork set acceptor to a (k, l)-contextual treeacceptor.


getInitialComposite The initial composite state will have a single column filledwith the initial state, from the fork set acceptor, that is associated with the inputsymbol (see Algorithm 5.3).

Algorithm 5.3 Function getInitialComposite for the conversion to a (k, l)-contextual acceptorInput: An input symbol a and a list: (F), containing the fork set acceptor.Output: The initial composite state: init.1: if αF (a) == nil then2: init = nil3: else4: init = new 2-dimensional array5: for level = 1 to l do6: init[0, level] = αF (a)7: end for8: end if

getCompositeOutputS The output of a composite string state cS , is a com-posite tree state, containing the original tree states for all the (cS .maxCol, j)-rootsthat are processed, up till that string state (see Algorithm 5.4).

Algorithm 5.4 Function getCompositeOutputS for the conversion to a (k, l)-contextual acceptorInput: A composite string state cS input symbol a and a list: (F), containing

the fork set acceptor.Output: A composite tree state: result.1: result = new array2: for level = 1 to l do3: result[level] = ∅4: for all s ∈ cS [cS .maxCol, level] do5: result[level] = addToSet(result[level], φFS(s))6: end for7: end for8: if getCompositeOutputT(result) == reject then9: result = nil

10: end if

Note that a tree is an element of a (k, l)-contextual language, when each of its(k, l)-forks is accepted by the associated fork set acceptor. Each subtree of thattree is also an element of that (k, l)-contextual language. This is easily proven,as the set of (k, l)-forks of that subtree is a subset of the set of (k, l)-forks of the


tree, and those are all accepted. This implies that, in a (k, l)-contextual acceptor,all intermediate tree states of an accepting run are accepting. Consequently, arejecting tree state is a dead tree state. As each dead tree state is equivalent,we will represent all rejecting composite tree states with a single dead tree state.In our implementation we represent the dead tree state with the symbol nil. Werepresent also the dead string state with the same symbol. Algorithm 5.4 thereforereturns nil, instead of a rejecting composite tree state.

Note also that one of the (k, j)-roots of a given tree resulting in a dead treestate, is sufficient to prevent the tree from being accepted if the (k, j)-roots arepart of the (k, l)-forks of the tree. Therefore there is no need to check the otherroots of the same depth j. When we collect the set of roots at the same level,we will discard them all when one of them is a dead state (nil). We implementthis in an auxiliary function addToSet (Algorithm 5.5), as we will use the samefunctionality for string states. We represent a set of states containing a dead state,also as nil.

Algorithm 5.5 Auxiliary Function addToSetInput: A set of states: set and a state sOutput: The resulting set: result.1: if set == nil or s == nil then2: result =nil3: else4: result = set ∪ {s}5: end if

getCompositeOutputT The output of a composite tree state, reflects whethertrees that results in that state belong to the (k, l)-contextual language or not. Atree is accepted, when each of its (k, l)-forks is accepted by the fork set acceptor.Its (k, l)-forks are split up in its (k, l)-roots, and the (k, l)-forks of each of itschildren. As we make sure that the tree state results in a dead state (see theexplanation for the transition function later on) when one of its children resultsin a dead state, the output of a (non dead) tree state will reflect only whetherthe (k, l)-roots of the tree are accepted. The tree states (reached by the fork setacceptor) for the (k, l)-roots of the tree, are placed in the array of the compositetree state at position l. Therefore, the output of the composite tree state is accept,only when each of these states is accepting. This is calculated in Algorithm 5.6.

getCompositeTransition The transition of a composite string state to anotherstate, for a given composite tree state, will use the transition function of the fork setacceptor to calculate the transition of each of the original states in that compositestring state, with as input, the original tree states at the correct level in the


Algorithm 5.6 Function getCompositeOutputT for the conversion to a (k, l)-contextual acceptorInput: A composite tree state cT input symbol a and a list: (F), containing the

fork set acceptor.Output: The output of the composite state: result.1: result = accept2: for all s ∈ cT [l] do3: if φFT (s) = reject then4: result = reject5: end if6: end for

composite tree state. As an extra child of the fork gets processed, the results forthe elements of a given column are placed in the next column. This function isshown in Algorithm 5.7. After k + 1 transitions, the fork set acceptor will alwaysend up in a dead string state, because each node in a (k, l)-fork has maximallyk-children. When the composite representation has k columns, the transitions forthe last column are therefore not calculated.

Note that we make the assumption that the input to this conversion operatoris a (k, l)-fork set acceptor. When this is not the case, and the acceptor passed tothe operator accepts trees with a larger branching factor than k, the algorithm asspecified above, will ignore such trees. This property is used in Section 6.1.2.

The (i, 1)-root of a tree has no children. It is the root of that tree. Thereforethe (i + 1, 1)-root of that tree will remain the same root (see Line 6).

Example 5.14 To illustrate the composite representation for this conversion op-erator, we show a part of the (3, 3)-contextual tree acceptor learned for tree t fromExample 5.10. We do the conversion from the fork set acceptor shown in Fig-ure 5.4. This is the minimal acceptor equivalent to the fork set acceptor fromFigure 5.3. The part of the (3, 3)-contextual acceptor (shown in Figure 5.5) issufficient to process (and accept) the first of the two trees shown in Figure 5.2.

The initial composite state , associated with the symbol b contains on every

level the initial state for b from the fork set acceptor, being the state reached whena leaf b (without any children) is processed. The (3, j)-roots of a leaf, are all thesame (that leaf), and the tree state resulting from the fork set acceptor for this

leaf is the accepting tree state 2. Therefore the output of the initial state is .

Algorithm 5.7 Function getCompositeTransition for the conversion to a (k, l)-contextual acceptorInput: A composite string state cS , a composite tree state cT and a list: (F),

containing the fork set acceptor.Output: The next composite string state: next1: max=cS .maxCol2: if max < k then3: max = max + 14: end if5: next = new 2-dimensional array6: for col = 0 to max do7: next[col, 1] = cS [0, 1] // initial states8: end for9: for level = 2 to l do

10: next[0, level] = cS [0, 1] // initial states11: for col = 1 to max do12: next[col, level] = ∅13: for all stateS ∈ cS [col − 1, level] do14: for all stateT ∈ cT [level] do15: next[col, level] =addToSet(next[col, level],δFS(stateS, stateT ))16: end for17: end for18: end for19: end for

Figure 5.4: The minimal fork set acceptor for the (3, 3)-forks of tree t from Exam-ple 5.10.


The output of the initial state for symbol a, is calculated as . Given that state 1

is a rejecting state though, this output is equivalent to the dead tree state (and istherefore not shown in Figure 5.5).

A few transitions later we reach the string state . The first row contains

the initial states, just like in every composite state. The second row contains thestring states of the fork set acceptor, reached after processing respectively the (i, 2)-

roots of a,c

aand

b c

a. Because the depth of these three trees is smaller than 3, their

(i, 3)-roots will be the same as their (i, 2)-roots. Therefore the third row equals thesecond one.

Because the composite states only keep track of the last k children that are

processed, a loop occurs in the acceptor, that oscillates between state , in

which end up trees that haveb c b

aas last fork, and state , in which end up

trees that havec b c

aas last fork.

Proposition 5.7 Given a tree acceptor T , constructed by using Algorithms 5.3,5.4, 5.6, and 5.7, starting from a (k, l)-contextual fork set acceptor F , then T is a(k, l)-contextual tree acceptor with as representative set, the set of forks acceptedby F .Proof It is clear that using the general construction algorithm presented inSection 3.3.4, with Algorithms 5.3, 5.4, 5.6, and 5.7 as the respective functionsgetInitialComposite, getCompositeOutputS, getCompositeOutputT, and getCom-positeTransition, will result in an automaton equivalent to the automaton based onF , using the composite representation described in the beginning of this section.

Remains to be proven that for this composite representation holds that on onehand it accepts every tree, whose set of (k, l)-forks is accepted by F . And on theother hand that for every tree that is accepted, the set of its (k, l)-forks is acceptedby F .

The l-th element of the array that represents a composite tree state for a giventree, contains the set of tree states resulting from processing the (k, l)-roots ofthat tree with F . When a composite tree state is accepting, each of these treestates is accepting, meaning that all the (k, l)-roots of that tree are accepted by F .As mentioned in previous paragraphs, the composite representation is constructedsuch that every composite tree state is either accepting, or a dead state. This


Figure 5.5: A small part of the (k, l)-contextual tree acceptor for Example 5.10

implies that if a tree state resulting from processing a tree is accepting, thatall the tree states resulting for its subtrees are also accepting (Definition 3.31).Therefore when the tree state for a tree is accepting, the (k, l)-roots of each ofits subtrees are accepted by F , or (given the definition of (k, l)-forks) the set of(k, l)-forks of that tree is accepted by F . This proves the second part.

The first part we prove by contradiction. We assume that a tree, whose set of(k, l)-forks is accepted by F , is not accepted by T . Therefore the composite treestate for this tree is a dead state. In the composite representation, a tree state fora tree is a dead state only if one of the subtrees results in a dead state or one ofthe (k, l)-roots of that tree is rejected by F . We know that the (k, l)-roots of thattree are accepted (they are a part of the (k, l)-forks of that tree), hence one of thesubtrees of that tree results in a dead tree state. But for these tree states holds thesame reasoning: their (k, l)-roots are accepted, hence one of their children has toresult in a dead state. This process ends in the leafs. One of the leafs has to resultin a dead tree state. As a leaf does not have children, that leaf can only result ina dead state when its only (k, l)-root, the leaf itself, is rejected. This results in acontradiction because the leafs of the tree also belong to its (k, l)-forks, and thoseare accepted. �


5.4.2.2 More Optimal Representation

We typically perform a minimization operation on the result of the conversionbetween a fork set acceptor and a (k, l)-contextual acceptor. If we can merge someequivalent states early on in the conversion step, both conversion and minimizationwill be more efficient, as the intermediate result will be smaller. Below we dis-cuss configurations in the composite representation of tree and string states, thatcan be considered equivalent. And we propose an adaptation to the compositerepresentation, such that these configurations map onto the same representation,leading to a single state.

The last element of the array in the composite tree representation contains theset of states of the fork set acceptor, reached for the (k, l)-roots of a tree. Since the(k, l)-roots are not used to construct (k, l+1)-roots (l being the maximal depth ofthe forks), the single use of this set of states is to indicate whether the compositetree state is accepting or rejecting. Therefore every two composite tree stateshaving the first k − 1 elements of their array in common, and having the sameoutput (accept or reject), are equivalent tree states. As all tree states (differentfrom the dead tree state) are accepting, we represent the tree state by only thefirst k − 1 elements.

Example 5.15 Observing the (k, l)-contextual acceptor learned for tree t fromExample 5.10, we see that in the original composite representation there are 7 tree

states: , , , , , , and . While in the new representation, these map

onto 5 tree states: , , , , and .

The same holds for composite string states. The k-th column is not used in thecomputing of the transition, only to compute the output of the string state. Butin contrast to tree states that have all the same output: reject (except for thedead state), it is possible for two string states to have different outputs. We willtherefore change the representation of the string state from a matrix of k columns,to a pair of a matrix with only k−1 columns, and the tree state that is its output.This is not sufficient though. We have to add an indication whether there arealready k children processed in the state or not. Otherwise an original compositerepresentation with k− 1 columns, and an original composite representation withk columns, that have the same initial elements, and the same output will beconsidered as a single state.

Example 5.16 Observing again the (k, l)-contextual acceptor learned for tree t

from Example 5.10, we see that the tree states and in the origi-

nal composite representation only differ in their last columns and that they have


equivalent outputs: and . Hence these two states are equivalent themselves,

mapping in the new representation onto the same state ( , ,true).

In Figure 5.6, we show the complete (3, 3)-contextual tree acceptor, acceptingthe (3, 3)-contextual tree language defined by the (3, 3)-fork set of tree t from Ex-ample 5.10. Note that in the graphical representation we do not show the output ofthe composite string state, as a part of the composite string state, as this is shownanyway in the node, as the output for that node. We also left out the first columnand the first row, as they always contain the same initial state. For each of thedisjoint parts of the acceptor graph, the initial state can be found as the initialstate for the same symbol in the fork set acceptor of Figure 5.4. To indicate thata string state is reached after at least k transitions, the state has triangles addedon both sides.

Example 5.17 With the new representation, the resulting (k, l)-contextual treeacceptor, as shown in Figure 5.6, is the minimal equivalent acceptor. The newrepresentation though, is no guarantee for a minimal result. We give anotherexample in Figure 5.7, consisting of an other set of forks (k = 2, l = 3), its forkset acceptor, and linked (k, l)-contextual acceptor. The (k, l)-contextual acceptor

in this example is not minimal, because tree states and are equivalent.

State is also an example of a state with a set containing more than one state.

Note that we indicate that a position contains a dead state, by leaving that positionempty.

When a composite string state has a dead state in the k-th column for a givenlevel, we know that all composite string states reachable from that string statekeep that dead state in their k-th column for that level. This implies that theiroutputs are tree states having a dead state for that level. All elements in the rowbefore the k-th column will therefore never make any difference. Two string statescontaining similar elements except for a single row, and both having a dead stateas k-th element of that row, are therefore equivalent. We change the compositerepresentation, such that when a row has a dead state as k-th element, each elementof that row is replaced by a dead state. This way, states that can be proven to beequivalent according to the above logic, will map onto the same representation.

Example 5.18 Given yet another set of forks (Figure 5.8), with associated forkset acceptor, and (k, l)-contextual acceptor (k = 2, l = 3), we illustrate this lat-


Figure 5.6: (k, l)-contextual tree acceptor for Example 5.10


a a

b

a

a c

b

b

c a

b

b

a a

b

b

a a

b

c a

b

a c

b a c

Figure 5.7: Examples of Finite Tree Acceptors

a b

a a

a

a a

a b

a

a

a

a

b

b

a

a

b

Figure 5.8: Examples of Finite Tree Acceptors

est optimization. Using the original representation we end up with states

and , having both output: . The first optimization results in the separate

states and , both with as output. The second optimization, on the other

hand, does map these two states onto a single state: . Combined the two

optimizations result in state .

A final observation is that when a tree has a subtree with a rejected (k, i)-root,


that whole tree will be rejected independent of the acceptance of the (k, i−1)-rootsof that subtree. Two composite tree states will therefore be equivalent when theyhave a dead state at a certain level, and equal sets of states above that level. Thesets below that level can be ignored. As a third optimization we will replace allsets of states below a dead tree state in a composite tree state, with a dead treestate. Thus we ensure that the states that are equivalent, according to the aboveexplanation, map onto the same state.

Example 5.19 In Figure 5.8, the tree state (or in the original represen-

tation) maps with this extra optimization onto (or in the original repre-

sentation). Note that for this specific example the optimization did not lead to amerging of tree states.

5.5 Summary

In this chapter we have shown how to learn tree languages and tree acceptorsfrom positive examples only. We started this chapter with an overview of existingstring approaches towards learning from positive examples only, that circumventthe finding that regular languages can not be learned from positive examples only.We concentrated on a subclass of the regular string languages: the class of k-contextual string languages that is learnable from positive examples only. Thisclass restricts the possible building blocks (k-grams) of a given size that can occurin strings of the language. We discussed its properties and their intuition.

As contribution, we defined (k, l)-contextual tree languages as a subclass of theregular tree languages. Similar to the class k-contextual string languages, this classrestricts the possible building blocks ((k, l)-forks) of a given size that can occur intrees of the language. We stated and proved different properties of this class. Oneof these properties states that this class is learnable from positive examples only.We presented a practical induction scheme for these languages.

Moreover, we indicated how to learn a (k, l)-contextual tree language, directlyas an acceptor. Our approach learns in a first step a fork set acceptor that acceptsthe representative set of the language, and in a second step, this fork set acceptoris converted to a (k, l)-contextual tree acceptor.

Chapter 6

Wrapper Induction with(k, l)-Contextual TreeLanguages

In Chapter 4 we have discussed how tree automata can be used to represent wrap-pers for information extraction from web pages. At that time we did not elaborateon obtaining the automata themselves. Now that we have seen the concept andinduction of (k, l)-contextual languages in Chapter 5, we will in this chapter pro-ceed to use (k, l)-contextual languages to represent wrappers and to learn themfrom examples.

In Section 6.1 we investigate the representation of wrappers as (k, l)-contextualtree languages, and we address several adaptations to improve the induction forthis particular domain. For practical use we need to know appropriate values forparameters k and l for a given task. A first solution, that we discuss in Section 6.2,is to estimate these parameters, based on the behavior of the different wrapperson unmarked data. In a second approach (Section 6.3), we learn the values of theparameters, based on a small set of negative examples. In Section 6.4, we improveon this last scheme, by devising an interactive algorithm that restricts the possiblenegative examples to the false positives of previous hypothesis. We also report onan implementation incorporating this algorithm, that allows easy user interactionthanks to a graphical user interface.

127

128 Wrapper Induction with (k, l)-Contextual Tree Languages

6.1 Information Extraction with (k, l)-Contextual

Tree Languages

Using marked trees as examples, we can learn marked (k, l)-contextual tree lan-guages. To learn a wrapper for a specific extraction task, we use single correctmarkings for that task, as examples. This ensures that the learned language willaccept a superset of SCM. For now we assume that the extraction task that istargeted, is indeed expressible by a marked (k, l)-contextual tree language. Wewill elaborate more on this in Section 6.3. It is not guaranteed that the learnedlanguage will accept ESCM. For example when the extraction task has only onetarget per page, it is possible that the necessary unmarked forks to accept emptymarkings, are never learned, as each of the example pages has the single targetmarked. One solution is to present for each example, also the unmarked page asan example to the induction algorithm. Another solution, applied after learning,is to add to the representative set, all the forks of the representative set with theirmarkers stripped off. Either solution results in a language that accepts a supersetof ESCM, and logically a subset of PCM. In most cases PCM will be accepted, butnot when building blocks are large enough to contain two targets simultaneously.This last problem cannot be detected or fixed easily.

Having a guaranteed ESCM accepting language, allows one to perform extrac-tion based directly on the learned representative set. One can mark each time asingle node in the tree. If this marked tree is accepted, the marked node is one ofthe target nodes. Hence only n markings have to be checked, with n the numberof nodes. A similar approach is also used in (Kosala et al. 2006; Kosala et al.2003; Kosala et al. 2002). More efficiently, we can create the (k, l)-contextualtree acceptor (see Section 5.4) associated with the learned language, convert itsubsequentially to a PCM and a CCM acceptor (see Sections 4.2.5 and 4.2.4),and use the last one to do single run extraction (see Section 4.4).

The remainder of this section introduces two generalizations to the basic (k, l)-contextual tree construction. The first one introduces wildcards to generalizeover irrelevant text nodes and the second one ignores forks which are not in theneighborhood of the node of interest. Furthermore, we detail a specialized versionof the operator to convert to the (k, l)-contextual tree acceptor, that takes thesegeneralizations into account.

6.1.1 Practical Wrapper Induction

As the text nodes come from an infinite alphabet we cannot learn them from a smallnumber of examples. To solve this, we follow (Kosala et al. 2006; Kosala et al.2003; Kosala et al. 2002): all text nodes in the examples are replaced by a wildcard(@). During extraction the wildcard matches with every text node, even those notseen during the learning phase. Sometimes this leads to overgeneralization, when

6.1 Information Extraction with (k, l)-Contextual Tree Languages 129

@ @ @ @ @

@ a @:A a @ a @ a @ @ a

@ i b a b a b a b a @ a b

h4 p p p center

body

html

@:N @ @ @

‘name:’ b @ b ‘name:’ b @ b

li li li li

ul ul

body

html

Figure 6.1: Positive example for author extraction in the Article Database (left)and for student extraction in the Student List (right). In both examples, one nodeis marked (with respectively “A” and “N”). For the Article Database, all textnodes have been replaced by the wildcard “@”; for the Student List, {‘name :′} isused as distinguished context and is not replaced by the wildcard symbol.

a text node close to the target, is needed to disambiguate between a positive anda negative example. We call such a text node a distinguishing context. A set ofdistinguishing contexts can be given, to keep text nodes with these context frombeing replaced.

Example 6.1 Figure 6.1 shows a positive example, for the author extraction taskof the Paper Database example from Section 2.2.1.2 (using a marker ‘A’) as wellas a positive example for the name extraction task of the Student List examplefrom Section 2.2.1.1 (using a marker ‘N’). One node is marked in both examples;a distinguishing context is used for the name extraction task.

The heuristic we use to determine the set of distinguishing contexts is different(and simpler) than the one used in (Kosala et al. 2006; Kosala et al. 2003; Kosalaet al. 2002). We inspect all positive examples in a preprocessing step. For eachpositive example we collect into a set, those text nodes occurring in the (k, l)-forksfor that example, that contain the marked node. The set of distinguishing contextsis then the intersection of these sets. This way text nodes are only generalizedwhen there is a positive example for which they do not occur in its parameterizedneighborhood. This procedure guarantees that (given sufficient examples) all thestrings remaining in the resulting set are ‘true’ context for the target node. It ispossible though that some discriminative context string is not found (for example,when the target is a node with as context either c1 or c2). So far we have notencountered the need for a more elaborate procedure. A final remark is that theuse of distinguishing contexts can be turned off. The boolean which controls thisfeature can be considered as a third parameter of our algorithm; the others beingk and l.


When learning marked (k, l)-contextual tree languages, the set of forks in therepresentative set splits naturally in two; a set of forks with a single node marked,and a set of forks without any marked node. One can argue that the forks contain-ing the marker provide the local context needed to decide whether a node shouldbe extracted or not, while the other forks describe the general structure of the doc-ument. The latter merely serve to decide whether the document is in the domainof the extraction task. Learning the domain typically requires substantially moreexamples than learning the local context. However, in our setting, we assume alldocuments are from the correct domain; hence there is no need to learn the domainand we can ignore all forks not containing the marker during learning and extrac-tion. Hence for learning we collect only the marked forks into the representativeset. For extraction we still iterate over each node, mark it, and check whether theresulting tree, marked with a single marker, belongs to the learned language. Wedo this by collecting only the marked forks, and check whether they form a subsetof the representative set. This way a small set of examples will be able to coverthe variance of forks in areas far away from the targets. This makes the wrapperalso more robust, in case of changes in the generating script. An example couldbe that the pages in our domain contain a header with some advertisements. Notargets are in the direct neighborhood of this header, and therefore only unmarkedforks cover this header. When the advertisements change on a regular basis, thewrapper needs no update, as the unmarked forks are ignored. Of course, whenstructures in the header become similar to the structure around the target ele-ments, and the wrapper starts (wrongly) extracting elements from the header, thewrapper needs to become more specific.

Example 6.2 Given the positive example for the author extraction task, shownin Figure 6.1, and given the parameters k = 1 and l = 3, we learn the wrapperby collecting the marked forks. The representative set for the wrapper becomes@:A ,

@:A

a,

@:A

a

p . Marking the node containing ‘author2’, and collecting its

(1,3)-forks results in the set

‘author2’:A,‘author2’:A

a,

‘author2’:A

a

p . Given that

a wildcard matches every text node, each of these forks matches with one of theforks in the representative set, hence “author2” is extracted. For the text node

containing ‘1’, the collected set of (1,3)-forks is

1:A ,1:A

a,

1:A

a

center. The last fork

does not match with the forks in the representative set and the marked text node

6.1 Information Extraction with (k, l)-Contextual Tree Languages 131

is rejected.For the student extraction task, the tree as shown in Figure 6.1 gives as rep-

resentative set of the wrapper learned for k = 2 and l = 3 the following set of

forks:

@:N,@:N

b,

@:N

‘name:’ b

li. This wrapper wil extract all students, and reject

all other nodes.

6.1.2 Learning Wrappers as Automata

For information extraction we use only the marked forks to represent the wrappers.But when representing (k, l)-contextual tree languages as automata (Section 5.4),no distinction is made between marked and unmarked forks. We can simulate thebehavior of the above wrappers when we use the same set of marked forks, andas unmarked forks, we use the set of all possible unmarked forks. This way norestriction is put on the occurrence of unmarked forks, which amounts to ignoringthem. Below, we show how we can efficiently construct a fork set acceptor that-accepts only marked forks. And we discuss how we can use this marked fork setacceptor, to create a (k, l)-contextual tree acceptor that accepts the same trees aswrappers based on marked forks.

We assume that each example is still a tree with a single target marked. Sincewe only want to accept the marked forks, there is no need to scan the wholetree as in Section 5.4.1. It suffices to process only the nodes that can occur in amarked fork. With only one node in the tree marked, only the (k, l)-roots from themarked node itself, and the (k, l)-roots from its l− 1 direct ancestors can containthe marker. But not all of them contain the marker. From the children of everyancestor (of the marked node), only the k − 1 siblings before and after the childthat is also an ancestor (or the marked node itself) need to be considered. Startingfrom the marked node, and using a data structure where one has access to theparent, one can construct a variant of Algorithm 5.2, that disregards the irrelevantparts of the input tree. This algorithm returns a marked fork set acceptor, withouthaving to traverse the examples trees top-down.

To use the operator that converts a fork set acceptor to a (k, l)-contextualtree acceptor (Section 5.4.2), we could start from the union of the marked forksset acceptor, with an acceptor accepting every possible unmarked (k, l)-fork overthe given alphabet. In practice we use, instead of the latter, an acceptor thataccepts every possible unmarked tree over the alphabet. This acceptor acceptsa superset of the set of possible unmarked (k, l)-forks. Given that the unmarkedtrees accepted by this acceptor, that are bigger than (k, l)-forks, are ignored inthe conversion operator, the result will be the same. This way we only needone acceptor instead of a different acceptor, accepting every possible unmarked


(k, l)-fork, for different values of k and l.

6.2 Parameter Estimation

Given a training set and a test set, we define the optimal parameters (k,l) asthose parameters resulting in the best F1 score when a wrapper, learned from thegiven training set with these parameters, is evaluated on the given test set. Asmentioned before, we typically do not have a set of completely annotated examples,that can be used as test set, at our disposal during the learning phase. Thereforewe will estimate these optimal parameters, based on the small training set. Weshare some observations below, leading to a heuristic for estimating the parametersbased on extraction from unmarked pages. Unmarked pages can be obtained asextra unmarked pages, or as the pages from the examples in the training set,with their markers stripped off (an example has only one of its targets marked).Despite the lack of a strong foundation for this heuristic, experimental results(Section 7.3.1) indicate that it is certainly useful.

Proposition 5.4 states that for a given set of examples, the learned language isanti-monotonic in the parameters k and l. Smaller k and l, result in more generallanguages. In the application to extraction, we have seen that only the markedforks matter in the decision to select a node or not. The set of marked forks formsa kind of rough window around the marked node, whose size is determined bythe parameters k and l. The language learned with the smallest window (k =1, l = 1) is the most general; it extracts everything hence recall is 100% andprecision is low. As the (k,l)-window is enlarged, more and more of the featuressurrounding the marked node are taken into account, less fields are extracted andprecision increases, however, at least in the beginning, recall is likely to stay at100%. If the extraction task can be expressed with a contextual language then—if enough examples are given— precision will reach 100% while recall is stillat 100%. Typically, there will be a region of parameter settings for which theF1-score remains 100%. In that region, the number of extracted fields remainsconstant. Part of the explanation for this phenomenon is that HTML does notonly convey the structure of the document but also the extra complexity of thelook, leading to structural redundancy. When the (k,l)-window is further enlarged,a point will be reached where the number of extracted fields start to drop again;while precision remains at 100%, recall (and hence the F1-score) start to dropagain at a faster pace.

Hence, for the parameter setting (1, 1), all fields are extracted. As k and lincrease, the number of extracted fields quickly goes down; however one can expectthe number of extracted fields to stabilize in the region with optimal F1-score andthen to further go down again. This behavior is illustrated in Example 6.3.

Example 6.3 Table 6.1 and Table 6.2 show the number of extractions for twodata sets where the wrapper is learned from five random examples (see Section 2.3

6.2 Parameter Estimation 133

lk 1 2 3 4 5 6 7

1 1267

2 141 141 141 141 141 141 141

3 126 115 115 115 115 115 115

4 120 115 115 115 115 115 115

5 120 115 115 115 115 115 115

6 120 115 115 111 111 111 111

7 115 115 98 97 90 90 75

Table 6.1: Number of extractions for okra-1

lk 1 2 3 4 5 6 7

1 969

2 412 402 402 402 402 402 402

3 309 299 299 299 299 299 299

4 309 98 98 98 98 98 98

5 309 98 82 51 38 31 27

Table 6.2: Number of extractions for bigbook-3

for details about the data sets used). The extractions are performed on the (un-marked) pages that contained the examples. In Table 6.1 we see a large regionwith 115 extractions (corresponding to the optimal F1-score). In Table 6.2 thereis a region with 98 extracted fields (corresponding to the optimal F1-score) thatstretches over 2 l-values, but it does not cover a 2× 2 region.

Starting from a parameter setting (1, 2), our heuristic searches for a region wherethe number of fields, extracted from some evaluation set, changes minimally. How-ever, the search is interrupted when it becomes clear that the model becomes toospecific. Two tests are used to check this:

• A clear indication of a too specific model is when the number of extractionsequals the number of examples.

• Another test checks whether a value of l is reached for which the number ofextractions keeps decreasing for increasing k-values. This behavior is seenin lines 7 and 5 of respectively Table 6.1 and Table 6.2. The explanationis that the depth of the marked forks (l) is becoming so large that theycontain a common ancestor of the target fields. For such l-value, the subtreescontaining a target have a different position (first, second, . . . , last) as a childof the common ancestor. With increasing k, the automaton can distinguishmore and more of these positions and the number of extractions decreasesunless there is an example for every position.


Precision will never reach 100% when (k,l)-contextual tree languages cannot modelthe extraction task. When it is expressive enough, but not enough examples areavailable for learning, the F1-score will not reach 100%. However, still, there islikely a region where the F1-score is optimal and the number of extracted fieldsstabilizes.

6.2.1 Implementation

Let E be the training set, W (k, l) the wrapper learned for the parameters k andl and T the evaluation set. We define maxl as the maximal depth found in thetrees of the training set, and we define maxk as the maximum number of childrenof a node found in the training set. The number of extractions made on T withW (k, l) is denoted as ne(k, l). To evaluate the quality of a wrapper we consider a(2× 2)-region and calculate the sum of the differences between ne(k, l) and threeof its specializations, i.e., we define diff (k, l) = (ne(k, l)−ne(k, l+1))+(ne(k, l)−ne(k + 1, l)) + (ne(k, l)− ne(k + 1, l + 1)).

Algorithm 6.1 shows a high level description of our implementation. The ini-tialization computes the bounds maxk and maxl on respectively k and l and anupper bound maxdiff on diff (k, l). The algorithm searches for the parameter set-ting with the minimal difference and returns the values closest to the origin (1, 1) incase the corresponding parameter setting is not unique. The Manhattan distanceis used to measure the distance to the origin and the outer loop of the algorithmiterates over the Manhattan distances.

The inner loop iterates over the possible l-values for a given Manhattan distance(starting from l = 2 since l = 1 returns all candidate targets). After calculating thecorresponding k-value, diff (k, l) is computed. Doing so requires knowledge aboutthe number of extractions by W (k, l), W (k+1, l), W (k, l+1), and W (k+1, l+1).The values ne(k, l), ne(k +1, l), . . . are tabled (not shown in the code), so that thenumber of extractions by each wrapper is counted only once. If the difference isbetter than the best value so far, then the best values are updated. The second if-clause checks whether the optimal solution has already been found. This is the casewhen bestdiff = 0 or one of the stopping criteria as described in Section 6.2 is met.The latter is computed by the function tooSpecific(k, l) that also uses the tabledne(. . .) values. More precisely, the function succeeds when either ne(k+1, l+1) =#(E) or ne(1, l+1), ne(2, l+1), . . . , ne(k+1, l+1) contains more than 3 differentvalues. The stop criterion is always met before the loop boundaries are reached.

6.3 Learning the Parameters

We describe in this section how we can learn good values for the parameters kand l, based on an additional small set of negative examples. This deviates fromour premise that we learn from positive examples only, therefore the question

6.3 Learning the Parameters 135

Algorithm 6.1 Parameter EstimationInput: The training set E and the evaluation set T .Output: The estimated parameters as a pair (k, l).1: (maxk,maxl) := calcBoundaries(E)2: bestdiff := 3 ∗ ne(1, 1)3: for d = 3 to maxk + maxl do4: for l = 2 to d− 1 do5: k := d− l6: if diff (k, l) < bestdiff then7: (bestk, bestl) := (k, l); bestdiff := diff (k, l)8: end if9: if bestdiff = 0 or tooSpecific(k, l) then

10: return (bestk, bestl)11: end if12: end for13: end for

might arise whether it is not a better idea to use an induction algorithm that isdesigned to learn from both positive and negative examples. We reply to this withthe following motivation. As algorithms that learn from positive and negativeexamples, learn over the whole class of regular tree languages, they typically needlots of examples. Our approach ensures that it can learn from positive examplesonly, by using a special subclass of the regular tree languages. This has as sideeffect, that the restrictions imposed on the language, dramatically reduce thenumber of examples needed to learn that language. Using negative examples tolearn the parameters for an approach learning from positive examples only, keepsthe advantage of needing only a small number of examples. The disadvantage thatthis subclass is less expressive is negated by the fact that our proposed methodindicates when (k, l)-contextual languages are not expressive enough to reach 100%F1-score for the extraction task at hand. This allows to switch to a more expressiveclass of languages when needed. Note that we did not encounter any real worldexample yet, for which this necessity existed.

One can learn a language from positive and negative examples, by searchingthe most general language that accepts all the positive examples and still rejectsall the negative examples. We use a similar approach, in that our algorithm findsthe parameters k and l such that the (k, l)-contextual language learned from thepositive examples is the most general one still rejecting all the negative examples.

6.3.1 Algorithm

We go, one by one, over the different elements of our method.


Order relations We will use two order relations on languages. The first is thestandard set inclusion L1 ⊆ L2. Recall from Proposition 5.4 that this order isanti-monotonic in the parameters. The second order is defined using a finite set Sof trees. Let #acc(S,L) be the number of trees from S that belong to the languageL (the count). Then we define L1 ≥#

S L2 as #acc(S,L1) ≥ #acc(S,L2). Notethat for any S we have L1 ⊇ L2 ⇒ L1 ≥#

S L2, hence ≥#S is also anti-monotonic in

the parameters, i.e., the count decreases with increasing parameter values.

Solutions In what follows, we denote with [k, l] the (k, l)-contextual languagelearned from the given positive examples Pos. So, [k, l] equals L(k,l)(F(k,l)(Pos)).Any such set [k, l], for some parameters k and l, is called a potential solution. If,moreover, [k, l] is consistent with the negative examples Neg , i.e., if Neg∩[k, l] = ∅,we call [k, l] a solution. We define a solutionL1 to be better thanL2 when it extractsmore solutions from the documents used to learn the wrapper; more formally,when #acc(S,L1) ≥ #acc(S,L2). Hence the best solution is the solution that ismaximal in the order ≥#

S . Any set of documents from the domain will do. In ourimplementation we use the same set of documents, found in the examples.

Heuristic Due to the anti-monotonicity property, we have that #acc(S, [k, l]) ≤#acc(S, [k − 1, l]) and #acc(S, [k, l]) ≤ #acc(S, [k, l − 1]), hence #acc(S, [k −1, l]) and #acc(S, [k, l − 1]) are upper bounds on the value of #acc(S, [k, l]). Thealgorithm uses them to estimate the value of #acc(S, [k, l]) and, at each step,computes the count of the language with the best estimate. The search stopswhen the best estimate cannot improve upon the current best solution.

Example 6.4 Let Pos, the set of positive examples, be the singleton with as ele-ment a marked tree t, which is the tree for the HTML example in Figure 2.4, withthe node containing ‘title1’ marked. Some examples of potential solutions, given

the set Pos are [1, 2] = L(1,2)(

{@:T ,

@:T

a}

), [1, 3] = L(1,3)(

@:T ,@:T

a,

@:T

a

b),

[1, 4] = L(1,4)(

@:T ,

@:T

a,

@:T

a

b

,

@:T

a

b

p

), and [2, 4] = L(2,4)(

@:T ,

@:T

a,

@:T

a

b

,

@:T

a @

b a

p

).

When we define S as the set of trees with a single node marked, that are derivedfrom t, then #acc(S,L) returns the number of nodes, a wrapper based on L, willextract from t. For [1, 2] the extractions from t are {title1, author1, title2, au-thor2, title3, author3, Prev, 1, 3, Next}. Hence #acc(S, [1, 2]) = 10. The extrac-

6.3 Learning the Parameters 137

1 2 3 4 5 k2 333 524 275 48l

k c sol

2 5 33 No3 4 52 No4 3 27 Yes5 1 48 Nol

Figure 6.2: Parameter Space and Data Representation

tions for [1, 3] are {title1, title2, title3, Prev, Next}, and #acc(S, [1, 3]) = 5. Theextractions for [1, 4] and [2, 4] are {title1, title2, title3}, hence #acc(S, [1, 4]) =#acc(S, [2, 4]) = 3. For the set {author1, 1} as negative examples, we get that[1, 3], [1, 4], and [2, 4] are solutions. With [1, 3] the best solution as it is the mostgeneral (k, l)-contextual tree language accepting Pos and rejecting all negative ex-amples. While for the set {Prev}, only [1, 4], and [2, 4] are solutions. Accordingto the heuristic, the count of [2, 4] will be equal or smaller (equal in this example),hence there is no need for the algorithm to check [2, 4] after [1, 4] is found.

Initialisation We start the search from the language with the largest count(most general). Because the (k, 1)-contextual languages extract all the single nodeforks from the examples, they are overly general and of no interest. Therefore, thesearch starts from the (1, 2)-contextual language.

Algorithm To reduce the space requirements, our algorithm maintains for agiven l-value the count of at most one (k, l)-contextual language. If [k, l] is asolution, then the (k + 1, l)-contextual language is of no interest as it has a lowercount; if it is not a solution, then its count is discarded as soon as the count of the(k+1, l)-contextual language is computed. These counts are maintained in a front(of the search). For each l-value, the front maintains the k-value (F.k[l]), the count(F.c[l]) and whether it is a solution (F.sol[l]) (see the right of Figure6.2). In eachstep, the algorithm selects the minimal value l such that the language [F.k[l], l] ismost promising for exploration (the function BestRefinement): [F.k[l], l] is not asolution and the estimation of its refinement has the highest bounds on its count.For k > 1, the refinement is the language [F.k[l] + 1, l], however for k = 1, also[1, l + 1] is a refinement.

Example 6.5 Given the data in Figure 6.2, the languages [1, 5], [4, 3] and [5, 2]are candidates for refinement. Although [4, 3] has the highest count, its refine-ment [5, 3] has a count bounded by 33 while both refinements of [1, 5] have a countbounded by 48, hence the latter is selected for refinement.

A final point to remark is that it is useless to consider a language [k, l] with klarger than MaxK(Pos,Neg , l), the maximum branching factor for the forks of


a given depth l (it depends on l because only the forks containing the target areconsidered). Indeed, an increase of k will not affect the number of extractions.The algorithm below achieves this by setting the k-value at level l to ∞ and thecount to 0 when refining it. When this happens for all l values, then it means thatno wrapper based on (k, l)-contextual tree languages is expressive enough to reacha 100% F1-score. Note that there is always a solution when all examples comefrom a single document. The final set of forks then becomes ultimately the set ofmarked versions of the whole document.

Algorithm 6.2 Learning the ParametersInput: Pos and Neg , The sets of positive and negative examples.Output: The parameters k and l of the wrapper.1: calc(Pos,Neg , 1, 2) // initialization2: bestL = 23: while not F.sol[bestL] do4: if F.k[bestL]=1 then5: calc(Pos,Neg , 1, bestL+1)6: end if7: calc(Pos,Neg , F.k[bestL]+1, bestL)8: bestL = BestRefinement(F );9: end while

10: return F.k[bestL] and bestL

Function: calc(Pos,Neg , k, l)1: if k > maxK(Pos,Neg , l) then2: F.k[l]=∞3: F.c[l]=04: else5: F.k[l]=k;6: W = learnWrapper(Pos, k, l)7: F.sol[l]=W rejects all N8: F.c = cnt(extractions(W ,Pos,Neg))9: end if

The algorithm is sketched in Algorithm 6.2. F is the array representing thefront as shown in Fig. 6.2. For a given l value, the values F.k[l], F.c[l], and F.sol[l]give respectively the k-value, the count and whether [k, l] is a solution. It is ini-tialized for l = 2 with k-value 1. The function BestRefinement(F ) returns thel-value of the best candidate for refinement (as described above) if it exists, oth-erwise it either returns the l-value of the solution or reports failure. The functioncalc(Pos,Neg , k, l) updates F [l] with the appropriate values. Note that two refine-ments are computed when the selected best candidate has a k-value of 1. As longas there are candidates for refinement (non-solutions) that have a larger bound

6.4 Induction with Equivalence Queries 139

than any of the solutions already encountered, the BestRefinement will return anon-solution. Hence the algorithm keeps searching for better (larger in the order≥#

S ) solutions even though some solutions are already found.

6.3.2 Learning with Context

The preprocessing step to collect a set of distinguishing contexts, as describedin Section 6.1.1, ensures that the context grows with an increase in k or l. Asthe count of a wrapper decreases with growing context, the anti-monotonicityproperty is still valid and our algorithm can easily be extended to learn a wrapperwith context.

Example 6.6 Given the document from Figure 2.3, with a positive example thathas ‘Stefan’ marked, and a negative example that has ‘Hendrik’ marked. Using nodistinguishing contexts the algorithm reaches a solution for (k, l) = (2, 4), namely

the language L(2,4)(

@:N,

@:N

b,

@:N

@ b

li

,

@:N

@ b

li

@

@ b

li

ul

), while using distinguishing

contexts a solution is reached for k = 2 and l = 3: L(2,3)(

@:N,@:N

b,

@:N

‘name:’ b

li).

Not all data sets need a context. In principle, one could learn the wrapper withcontext and the wrapper without context independently of each other. However,our system integrates both in one algorithm that maintains two fronts and selects,from both fronts, the most promising point for refinement. Note that, for a givenpoint (k, l), the count of the wrapper with context is bounded by the count of thewrapper without context; i.e., the latter value is used as an extra bound on thecount of the former (hence selection is such that the former will only be evaluatedwhen that bound is already known).

6.4 Induction with Equivalence Queries

When two examples contain the same set of marked forks, we can remove one ofthem from the training set, and the resulting language will stay the same. Arbi-trary sets of positive and negative examples, often contain redundant information.In many cases, the user needs a considerable insight in the learning algorithm torecognize which sets of examples are needed. Or, he has to present a large set ofexamples, in the hope that it contains the few right ones.


We therefore present an interactive system that asks the user, the informationit needs to improve its hypothesis. The system uses the algorithms from previoussection, and poses equivalence queries(Angluin 1988) to the user. After been givena set of examples, the system returns a hypothesis. The user is asked to indicatewhether the hypothesis is correct or otherwise to give a counterexample. Henceonly false positives and false negatives are allowed. Clearly, the true positivescontain only marked forks that are already encountered, and the true negativescontain marked forks that are already shunned. The system keeps on updatingits hypothesis, and querying the user, until the user is satisfied with the results.Most practical for the user is to check the already given pages or try out some newpages of his own choice. When he detects an error, he can signal that error to thesystem.

In Section 6.4.1 we indicate how to adapt the algorithm of the previous sectionfor an efficient interactive use. In Section 6.4.2 we show how a graphical userinterface can be used to restrict the user to reply to the system with only falsepositives and false negatives.

6.4.1 Interactive Algorithm

After each interaction the system updates its hypothesis. This is done by findingthe ≥#

S -most general language that is consistent with the current set of examples.After each user update, we could restart the algorithm from Section 6.3, and learnfrom scratch, given the newly expanded set of examples. However, an incrementalalgorithm is feasible:

• Adding a positive example (a false negative) to the set of examples increasesthe set of forks, hence the counts of all wrappers. However, a (k, l)-wrapperthat covers negative examples still does so and cannot become a solution. Itmeans that the search of a solution can start from the current front. Theinitialization of the new search for parameters consists of updating the countfields (F.c) in the front.

• Adding a negative example (a false positive) does not affect the set of forks.However the solution is invalid as it covers the new negative example. Afterupdating the (true) solution fields1 (F.sol), the search can resume from thecurrent front.

In short, the algorithm from Section 6.3 can be used. When a new example isreceived, the values in the front are updated and the search resumes.

Example 6.7 Assume we want to learn the ‘title’ task for the Paper Databaseexample. The user gives an initial example. Let us assume he picks ‘title1’ (sameas in Example 6.4, such that we can refer to the languages learned there). The

1When the example is from a new document, also the counts are updated.

6.5 Summary 141

system learns its first hypothesis and ends up with language [1, 2]. We see that inour example page, all title fields are marked, hence there are no false negatives.We could check on some other pages, however, the current page has several falsepositives: {author1, author2, author3, Prev, 1, 3, Next}. The user chooses oneof them. The system (see the description of the implementation below) disallowsthe user to mark true negatives like ‘search term’ and ‘2’ as negative examples.The algorithm updates its hypothesis. Example 6.4 shows that not every choice isequally informative. Choosing ‘Prev’ or ‘Next’ leads to the solution [1, 4], while allother false positives lead to [1, 3], necesitating an extra iteration. After reaching[1, 4], the user does not find any more counterexamples in the given page. He triessome other pages until he is convinced that the wrapper is indeed correct.

6.4.2 Implementation

We developed an implementation that represents the wrappers by tree automata,enabling the use of extraction in a single run. We added a graphical user interfaceto our application, which is basically an HTML-compliant browser, that allowsthe user to right-click on an element of the page to add an extra example. Thesystem colors the background of all elements that are extracted by its hypothesis.A click on a colored element is interpreted as a false positive, a click on a plainelement is interpreted as a false negative. This way the user is restricted to giveonly counterexamples to the equivalence query posed by the system.

Example 6.8 The steps to learn the ‘name’ extraction task for the Student Listexample are illustrated by screenshot of our application in Figure 6.3. Initially theuser has a blanc example page from the domain of the extraction task. There isonly one possibility: to click one of the extraction targets, and thus providing aninitial positive example. The learning algorithm can learn from a single positiveexample, and builds a hypothesis (in our case a (k, l)-contextual language). In thegraphical user interface, all elements satisfying this hypothesis are highlighted ingreen. We see that the values ‘Maurice’ and ‘Hendrik’ are falsely assumed to bestudents. When the user clicks on ‘Maurice’, this element is given to the incre-mental algorithm as a false positive (because it was indicated as a positive), andits color is turned into red. In the mean time the algorithm revises its hypothesis,such that it no longer includes ‘Hendrik’.

6.5 Summary

This chapter uses the (k, l)-contextual tree languages from previous chapter, tolearn wrappers. We defined wrappers based on correctly marked (k, l)-contextualtree languages, and we have shown how to learn the correct marking acceptors,that accept the same languages. This allows us to use the operators and singlerun extraction technique from Chapter 4.


A blank page. A positive example given.

The hypothesis reachedby the system.

Indication of a false positive,and the revised hypothesis.

Figure 6.3: An example of a training session, with the graphical user interface, onthe Student List example.

6.5 Summary 143

Concerning the parameters k and l we have proposed some approaches tochoose good values. A heuristic that estimates good values is based on the evo-lution of the number of extractions on unmarked documents from the domain,when traversing the search space defined by the parameters. This approach doesnot need extra data apart from the original positive examples. Another approachlearns the values of the parameters, by searching the set of values in the parameterspace, that results in the most general language learned from the set of positiveexamples, that rejects all example from a set of negative examples. We argue thatthe advantage of an approach, that uses negative examples to learn parametersfor an induction algorithm that learns from positive examples, over an approachthat learns directly from positive and negative examples, is that such an approachneeds very few examples. The weaker expressiveness of our subclass does not seema problem for most extraction tasks, and our algorithm is able to indicate whenthis is the case, allowing to switch to a more expressive (but more expensive)approach.

Finally, we introduced an interactive approach, in which the algorithm is ableto restrict the additional examples from the user to false positives, and false neg-atives. This ensures that the algorithm gets non redundant examples, such thatthe wrapper can be improved on each iteration. We presented an implementationof this interactive scheme, with a graphical user interface.

Chapter 7

Related Work andExperimental Comparison

The aim of this chapter is not to provide an in depth survey of the field of infor-mation extraction from web pages. For such a survey we refer gladly to (Kosala2003). We will cite some important related work, and discuss in a bit more detailthose state of the art approaches to which we compared our system experimentally.

We deal with string based methods in Section 7.1 and continue with tree basedmethods in Section 7.2. The results of an experimental study are presented inSection 7.3.

7.1 String Based Methods

Some string based methods (Chidlovskii et al. 2000) are node based, i.e., theyextract whole text nodes, but the majority can extract a substring from a textnode. Instead of finding a single node, they return the boundaries of the sub-string. Except for WIEN (Kushmerick et al. 1997) which is character based, allsystems are token-based. This means that the start and end boundaries are al-ways between two tokens. The target values will usually not contain half a token,and the higher granularity speeds up the learning phase. The use of wildcards fordifferent classes of tokens also drastically reduces the amount of examples neededfor learning. Some systems were designed for information extraction from freetext: BWI (Freitag and Kushmerick 2000), HMM (Freitag and McCallum 1999),SRV (Freitag 1998), RAPIER (Califf and Mooney 1999), but these are generalenough to apply to semi-structured data as well. WHISK (Soderland 1999) isdesigned for both free text and semi-structured data. We have chosen two of themore performant systems (BWI and STALKER (Muslea et al. 2001)) to compareexperimentally with our approach. Experiments in (Freitag and Kushmerick 2000)

145

146 Related Work and Experimental Comparison

show that BWI outperforms HMM, SRV and RAPIER (the last one only testedon free text). It is argued in (Muslea et al. 2001) that the rules used in WIEN,SoftMealy (Hsu and Dung 1998), SRV (Freitag 1998) and RAPIER (Califf andMooney 1999) are strictly less expressive than STALKER’s. In the same paper itis shown experimentally that STALKER outperforms WIEN. Below we describethe STALKER and BWI system a bit more in depth.

7.1.1 STALKER

To extract a subsequence from a sequence of tokens, the STALKER system usesa start and an end rule to find the boundaries of that subsequence. The startrules are executed in forward direction from the beginning of the sequence, theend rules are executed in backward direction. A STALKER rule is either a simplerule or a disjunction of simple rules. In the latter case the boundary is given bythe first simple rule that does not fail. The simple rules are based on a list of so-called landmarks. A landmark is a sequence pattern consisting of tokens and/orwildcards. When a rule is executed, it searches for a part of the sequence thatmatches the first landmark. From the end of this part the search for the secondlandmark is started, and so on. The boundary that is finally returned is either theend or the beginning of the part that matched the last landmark. Which one, isindicated by a modifier. This is respectively SkipTo and SkipUntil for using theend or the beginning (or BackTo and BackUntil for rules in the other direction).When the search for a landmark reaches the end/beginning of the sequence, therule is said to fail. STALKER uses multiple types of wildcards that form a typehierarchy. This hierarchy is shown in Figure 7.1.

Example 7.1 The rule SkipTo(<a>) applied on the HTML sequenceof Figure 2.4 returns the position at the end of the first occurrence of these threeconsecutive tags, i.e., at the beginning of ‘title1’, while the rule BackTo(<center>)BackTo(</a>) applied on the same sequence returns the position at the end of‘author3’. These rules will both fail on the HTML sequence of Figure 2.3. Therule SkipTo(name Punctuation) SkipUntil(Capitalized) with two landmarks, eachcontaining a wildcard, will, given the sequence of Figure 2.3, return the beginningof ‘Stefan’.

The STALKER induction algorithm starts from a set of positive examples (eachconsisting of a sequence wherein boundaries of a subsequence are given). As longas this set is not empty, a new simple rule is learned, those examples covered bythis rule are removed from the set, and the rule is added to the disjunction ofrules that will be the final result. The algorithm to learn a simple rule chooses oneseed example (the shortest example in the set) to guide the induction, the otherexamples are used to test the quality of candidate rules. The algorithm does notsearch the entire rule space for the best rule. In each loop it takes two rules from

7.1 String Based Methods 147

Capitalized AllCaps

Alphabetic Number

AlphaNumeric Punctuation

non-Html Html

AnyToken

Figure 7.1: Wildcard hierarchy. A token that matches a wildcard of a given type,will also match the wildcards of the ancestors of that type.

a given set of rules, one is the best solution in that set, the other is the best refinerin that set. Some heuristic rules are designed to define a ranking (best solutionand best refiner) over a set of rules. This ranking is based on properties of therules, and on the number and quality of the extractions of each rule on the otherexamples. The refinements of the best refiner, together with the best solutiongives the new rule set for the next iteration. This loop continues until a perfectsolution is found (one that either extracts correctly from an example or fails onthat example) or until all refinements fail. The initial set of candidate rules consistsof single landmark rules, with each landmark a single token or wildcard (occurringin the seed). The refinement step will either extend one of the landmarks of a rulewith an extra token or wildcard (the extended landmark has to match within theseed), or add a new single token/wildcard landmark somewhere in the rule (thetoken or wildcard has to occur in the seed).

In contrast with other string based methods, STALKER implements a hierar-chical extraction approach. An Embedded Catalog (EC) describes the structure ofthe data. This is a tree structure where the leaves are fields, and the internal nodeseither tuples or lists. Figure 7.2 shows the EC for the Paper Database example,the Student List example, and the Restaurant Guide example. Note that the ECformalism might not be expressive enough to represent some more complex datastructures. To extract a specific field, first the parent has to be extracted, andthe extraction rules are then applied on the subsequence extracted for the parent.To extract the author fields for the Paper Database example, first the completelist of papers is extracted (the rules from Example 7.1 achieve this). Then theindividual papers are extracted. And finally from each paper, the author field isextracted. The advantage of this approach is, that complex extraction tasks aresplit into easier problems. Disadvantages are that during learning more examplesare needed to learn for every level of the hierarchy1, and that errors in the differentlevels will accumulate.

1To learn list extraction, each example should consist of two consecutive elements of the list.

Title Author

SearchTerm LIST(Paper)

PD Document

Name Supervisor

LIST(Student)

SL Document

Name Type City Phone

SearchTerm LIST(Restaurant)

RG Document

Figure 7.2: Embedded Catalogs for the Paper Database example (left), the StudentList example (middle), and the Restaurant Guide example (right).

Example 7.2 Having a look at the Restaurant Guide example, we see that toextract the fields that are the lowest in the embedded catalog, we have to extractthe parent first. For the ‘Name’, ‘Type’, ‘City’, and ‘Phone’ field we have to extractfirst the ‘Restaurant’ subsequences. We consider the first subsequence extracted forthe tuple ‘Restaurant’:

New China Town ( chinese ) </a> Brussels 

Tel : + 32 ( 0 ) 2 345 67 89

The rule SkipTo(</a>) applied on this sequence returns the position at the endof the first occurrence (and single occurrence in this example) of the tag ‘</a>’,hence at the beginning of ‘Brussels’. The rule BackTo() goes backward andreturns the position at the end of ‘Brussels’. Hence these rules are a start and endrule for the ‘City’ field.

For the rule SkipUntil(AnyToken) we see that the first token of the restaurantsequence is matched by the wildcard ‘AnyToken’. As the modifier is ‘until’, thebeginning of that token is returned. This is the beginning of ‘New’ for the abovesequence. The rule BackTo(</a>) BackTo(‘(’) goes backward to the positionbefore the first matching ‘</a>’ token, and then continues going backward fromthere on until the first ‘(’ encountered. The position between ‘Town’ and ‘(’ willbe returned. Therefore these rules can be used to extract the ‘Name’ field. Notethat this is a sub node field extraction, because the end boundary is in the middleof a text node.

One level higher, to extract the sub-sequences for the tuple ‘Restaurant’ fromthe sequence extracted for the list of restaurants, we use the start rule and endrule repeatedly. The first start boundary though coincides with the start boundaryof the list (and the last end boundary coincides with the end boundary of the list).The start rule SkipTo(<a>) returns the position at the end of the first oc-currence of these two consecutive tags. The end rule for this extraction task is:BackTo().

The top level extraction task, in the hierarchy, is to extract the sequence con-taining the list of restaurants, from the whole document.

7.1 String Based Methods 149

7.1.1.1 STALKER with Co-Testing

Co-Testing is an active learning approach in which multiple views are defined onthe data. A hypothesis is learned in each of the views. For a set of unseen, andunmarked data, contention points are defined as the examples on which hypothe-ses disagree. A query about a contention point will improve at least one of thehypotheses.

Applied on STALKER, 2 views are used: each boundary can be described bya forward rule or by a backward rule. The Naive Co-Testing approach picks arandom contention point, while Aggressive Co-Testing tries to pick a contentionpoint that is likely to be wrong for both views, such that both hypotheses can beimproved. To order the contention points, patterns are learned on the content ofthe example fields. Contention points that differ most from these patterns in bothviews, are chosen first.

7.1.2 BWI

Like STALKER, the Boosted Wrapper Induction (BWI) extracts a subsequenceusing a start rule and an end rule. A BWI rule is a set of simple rules withan associated weight. During extraction, each simple rule in the set extracts aboundary and casts a weighted vote, to return a single winning boundary. Usingthe terminology from STALKER, a single BWI rule consists of two landmarks,called prefix and suffix. The rule searches for the first sequence that matches theconcatenation of prefix and suffix. The boundary point is placed in between thetokens matching the prefix and those matching the suffix. This is less expressivethan a simple STALKER rule. In BWI, both start rules and end rules go in theforward direction. The BWI system does not extract hierarchically. The rules areapplied to find every matching point in the entire HTML sequence.

Example 7.3 The BWI rule 〈 [<a>], [non-Html] 〉 looks for the firstsequence of the three html tags that are followed by an non-html token. Applied onthe HTML sequence from Figure 2.4, it returns the start position of ‘title1’.

The BWI learning algorithm uses a boosting approach. It learns repeatedly froma set of weighted examples. In each iteration, a simple rule is learned with a weaklearner. After each iteration, the weights of the examples are changed accordingto the performance of the learned rules. Examples that are extracted well get asmaller weight while for the others the weight is increased. Hence in each iteration,the weak learner focusses on the examples for which the results are poor. Thistechnique is shown to give significant improvements over the use of the weak learneron its own (Schapire and Singer 1999). The number of boosting iterations is givenas a parameter T.

The weak learner used in BWI, starts from an empty pair 〈 [], [] 〉. Thealgorithm searches for the best possible extension (front) of the prefix and the


best possible extension (back) for the suffix. A lookahead parameter L indicatesthe maximal length of the extensions. Every combination of the old and newprefix with the old and new suffix is given a score with a function that measuresperformance on the training set. The combination with the best score becomesthe new rule. This process is repeated until the rule remains unchanged. Thealgorithm can work with extra negative examples next to the positive ones.

It is not straightforward to compare the expressiveness of the (k, l)-contextualtree languages with the wrapper representation languages of STALKER or BWI.Therefore, an experimental comparison is performed in Section 7.3. Note thoughthat even when a correct wrapper can be expressed in the representation language,the heuristic search in STALKER does not guarantee to find it. For our inductionalgorithm holds that if a correct wrapper can be expressed as a (k, l)-contextualtree language, it will be found (given sufficient examples).

7.2 Tree Based Methods

In (Kosala et al. 2003), it is shown that wrappers learned directly from theunranked tree structure of the document perform better than wrappers learnedfrom ranked representations of the unranked document tree. The tRPNI algo-rithm (Carme et al. 2004) learns also from unranked trees. Its hypothesis spaceconsists of the whole class of regular unranked tree languages. However, it needscompletely annotated documents, which they show to be equivalent to learningfrom positive and negative examples. In more recent work, a derivative algorithmRPNIprune is presented that learns from incompletely annotated data (still posi-tive and negative examples). This algorithm is incorporated in the SQUIRRELsystem (Carme et al. 2007). We describe the SQUIRREL system in more detailbelow. Just like the Local Unranked Tree Inference(LUTI) algorithm (Kosala et al.2003), this is another learning approach from positive examples only.

Besides approaches that learn a wrapper, there is also a research direction thatexplores wrapper programming languages and the visual specification of wrappers,see (Gottlob and Koch 2004) for a representative example.

7.2.1 The Local Unranked Tree Inference Algorithm

The Local Unranked Tree Inference algorithm is closely related to our approach.As mentioned in Section 6.1.1, both methods start with a preprocessing step togeneralize over the text nodes in the training examples. Each text node becomes ei-ther a target (X), a distinguished context (C) or a generalized text node (@). Notethat in contrast to our approach, this approach uses only a single string as distin-guishing context instead of a set. Basically this method infers a (k, 2)-contextuallanguage (a special case l = 2 of our method). But some extra differences exist.

7.2 Tree Based Methods 151

X @ @ @

C b.X @ b C b @ b

li.X li li.C li

ul.X ul.C

body.X

html.X

X

C b.X

li.X li

ul.X ul.C

body.X

html.X

Figure 7.3: An illustration of the first and the second transformation, to augmentthe expressiveness for LUTI.

We can describe these as two transformations performed in a preprocessing stepon training examples as well as on the documents to be extracted.

The first transformation replaces every node f into a node f.X, if its subtreecontains the X-node. If the subtree does not contain the X-node but a C-nodethen it is replaced by f.C. Hence, limited information is passed infinitely upwards,making the method not purely local. However, the subclass remains inferable andthe expressiveness is enhanced.

The second transformation in (Kosala et al. 2003), although part of the in-ference algorithm, can also be explained as a preprocessing step. The automatonaccepts everything below a node that is not of the form f.X, i.e., all subtrees belowsuch nodes can be removed and only the path from the root to the X-node is left,together with the siblings of the nodes on that path; parts farther away from themarked node are ignored. This enhances the generalizing power of the resultinglanguage (and reduces the expressiveness).

Example 7.4 The left tree in Figure 7.3 shows the document tree of Figure 2.3(Student List example), after the first transformation, while the tree on the rightshows the result of applying the second transformation to that same tree. Note thatthe string “name:” is used as the distinguishing context and is replaced by C.

Thanks to the first transformation, the LUTI algorithm can express some globalvertical relations. The relation between the target node or a context node and anancestor which is an arbitrary number of levels higher can be described, despitethat l is always 2. Our algorithm (KL) is purely local and does not have this extraexpressiveness. Our experiments showed that local information in the verticaldirection (a high enough l parameter) was sufficient for all data sets.

The second transformation in LUTI reduces its expressiveness with regard toKL as all information about the siblings of the target node is removed while KLretains this neighborhood. We encountered several data sets where that informa-tion was needed to disambiguate positive and negative examples. Consider, for


instance, a table with bargains. The aim is to extract those with a picture of theitem. The picture, when present, occupies the first cell of the row (a sibling of thecell containing the target).

7.2.2 SQUIRREL

The algorithms tRPNI (Carme et al. 2004) and RPNIprune (Carme et al. 2007),use a special kind of automata; Node Selecting Tree Transducers (NSTT), whichallow to annotate trees (extract or not extract). The RPNIprune algorithm uses thesame generalization step as tRPNI: the merging of two states. But in RPNIprune ,not every merge is allowed, so that the algorithm effectively searches in a subclassof the regular tree languages. This subclass is constructed such that the algorithmcan learn from incompletely annotated examples, with as an extra side effect ofthe use of a subclass that less examples are needed for learning. We go into a bitmore detail below.

A pruned tree is a tree in which some of the subtrees are replaced by a spe-cial symbol T, and pruning NSTTs (pNSTTs) are automata that annotate prunedparts of trees. A pruning heuristic defines a way to generate pruned trees from reg-ular trees, where a single tree can result into multiple possible pruned trees. Theincompletely annotated examples are preprocessed; they are pruned according tothe pruning heuristic, so that the resulting pruned trees are completely annotated.The RPNIprune algorithm starts from these completely annotated examples, andrestricts the search to cut-functional automata. An automaton is cut-functional,when all corresponding nodes in pruned trees, derived from the same tree, areannotated consistently. The subclass of the regular tree languages in which issearched, is therefore defined by the cut-functional automata, which are derivedfrom a set of trees pruned according to a given pruning heuristic. The pruningheuristic being a parameter for the RPNIprune algorithm. This parameter influ-ences the expressiveness, and the number of examples needed. An extreme caseis the identity function, making the RPNIprune algorithm equivalent with tRPNI(the expressiveness of the complete class of regular tree languages, but needingcompletely annotated examples).

The SQUIRREL system uses the RPNIprune algorithm with a graphical userinterface very similar to ours (Section 6.4.2). In their active learning model, twotypes of queries are defined: correct labeling queries (CLQ) and query equivalencequeries (QEQ). A CLQ conforms to a request to the user to specify a false positiveor a false negative, in the current page. A QEQ asks to either confirm that thehypothesis is correct (as far as the user can judge), or to provide a new page onwhich the hypothesis still fails. In our system, the user is asked to specify a falsepositive or a false negative from whatever page he chooses, and indicate when heis satisfied with the latest hypothesis. These two are basically the same, giventhat a negative response on a QEQ has to be followed by a CLQ, except thatSQUIRREL places an extra restriction, in that all errors from a given page have

7.3 Experiments 153

to be resolved before a new page can be seen.

7.3 Experiments

The experimental comparisons in this section are split into two sections. Firstwe compare our induction algorithm from Section 6.1, together with the parame-ter estimation heuristic from Section 6.2, to other wrapper induction approachesthat learn from positive examples only (LUTI, STALKER, and BWI). Secondly,we compare our interactive approach using equivalence queries from Section 6.4,to other active learning systems (STALKER with Aggressive Co-Testing, andSQUIRREL).

For the purpose of doing these experiments we have obtained an implementa-tion of the BWI algorithm from the Fondazione Bruno Kessler. For the STALKERand Local Unranked Tree Inference algorithms we used an own implementation.Through extensive communication with the authors we have tried to stay as closeas possible to the original implementations. We used for both STALKER andBWI the same set of wildcards, the one shown in Figure 7.1.

7.3.1 Positive Examples Only Approaches

We aim to compare the ability of the different ‘positive examples only’ algorithmsto learn from a small set of positive examples. In our setup, each experimentselects 5 random examples from a data set. Each algorithm learns from the same5 examples, and the F1 score of the resulting wrapper on the whole data set iscalculated. This experiment is not intended to measure the number of examplesneeded by each algorithm but to measure which algorithm learns best from agiven sample of (incomplete) data. Table 7.1 shows for each task the mean over5 experiments. Note that due to the hierarchical nature of STALKER, it is givenmore information per example. Not only the boundaries of the target, but also theboundaries of its ancestors, and, when one of the ancestors is a list element, alsothe boundaries of one of its adjacent siblings. It also expects the embedded catalogfor the induction task at hand. For the WIEN data sets, we use the embeddedcatalogs originally used in the STALKER papers. We did not run STALKER onbigbook and okra due to a lack of embedded catalogs for them. We did not runBWI on them as it was not worth the effort of making an extra converter.

The STALKER algorithm is parameterless. The (k, l)-contextual algorithmpresented here needs three parameters: k, l, and whether to use distinguishingcontexts or not. These parameters need tuning for every task. In column KL(opt.)in Table 7.1, the results are shown for the set of optimal parameters for eachexperiment (based on F1 score on the test set). To have a fair comparison though,the parameters have to be chosen without the extra annotations in the test set.We therefore use our parameter estimation heuristic, based on unmarked pages.


Data set ctx LUTI KL (opt.) KL (est.) STALKER BWI

s1-1 89.1 100.0 100.0 92.2 100.0

s1-3 90.4 98.7 93.4 81.0 9.1

s1-4 78.8 100.0 100.0 93.1 42.7

s3-2 97.6 100.0 100.0 99.4 27.4

s3-3 98.2 100.0 100.0 96.3 6.4

s4-1 91.6 100.0 100.0 88.8 58.9

s5-2 93.8 98.9 94.7 91.3 27.6

s8-2 100.0 100.0 100.0 95.9 31.6

s8-3 100.0 100.0 100.0 91.3 96.6

s10-2 100.0 100.0 100.0 96.8 20.3

s10-4 100.0 100.0 100.0 96.3 10.8

s11-1 ✔ 100.0 100.0 100.0 91.7 1.9

s11-2 ✔ 100.0 100.0 89.4 8.1

s12-2 98.4 98.5 98.4 93.1 33.8

s13-2 100.0 100.0 100.0 95.9 38.7

s13-4 100.0 100.0 100.0 85.6 5.5

s14-3 99.5 100.0 100.0 78.0 14.1

s15-2 97.1 100.0 100.0 96.2 34.9

s19-4 100.0 100.0 100.0 100.0 17.2

s20-3 ✔ 98.5 100.0 100.0 100.0 97.7

s20-4 ✔ 97.5 100.0 100.0 99.6 100.0

s20-5 ✔ 97.5 100.0 100.0 100.0 86.4

s20-6 ✔ 98.5 100.0 100.0 84.0 100.0

s22-2 93.3 100.0 99.8 100.0 68.9

s23-1 97.6 100.0 100.0 87.5 99.5

s23-3 94.4 100.0 100.0 96.2 19.7

s25-2 97.2 100.0 100.0 93.5 20.9

s29-1 96.6 96.6 65.6 87.3 22.8

s29-2 100.0 87.8 36.8 60.7 28.4

s30-2 96.0 100.0 96.0 88.0 88.6

bigbook-2 94.3 100.0 97.3

bigbook-3 88.0 100.0 96.9

okra-1 ✔ 100.0 100.0 100.0

okra-2 ✔ 99.3 100.0 100.0

okra-3 ✔ 99.1 100.0 100.0

okra-4 ✔ 99.1 100.0 100.0

Table 7.1: Experimental comparison on how well the given algorithms performwith few examples. Each column shows the F1 score for the wrappers learned withthe respective algorithms on a set of only 5 random examples (each experimentis performed 5 times and the results are averaged). The first column indicatesthe data set, while the second column indicates whether the algorithms LUTI,KL(opt.), and KL(est.) used distinguishing contexts.

7.3 Experiments 155

As unmarked pages we use the pages from the given examples with the markersremoved. The results are given in column KL(est.). Note that the estimatedparameters are sometimes suboptimal.

The BWI system has two parameters for the learning phase: lookahead(L) andthe number of boosting iterations (T), and one parameter (τ) for the extractionphase that allows to make a trade off between precision(τ = 1) and recall(τ = 0).We used for T the values 10 and 100, but found them to make no difference. Theexplanation in (Freitag and Kushmerick 2000) seems valid: in contrast to freetext, semi structured documents have a very regular format, and therefore needonly a few boundary detectors. For τ we used the values 0, 0.5, and 1, but foundthem to have very small influence. The training time increases exponentially withthe lookahead. L is therefore a trade off between quality and time. We usedall values from 2 to 7. The LUTI algorithm has also two parameters: k andwhether to use distinguishing contexts or not. The results shown for both LUTIand BWI, are those with the set of optimal parameters for each experiment (basedon F1 score on the test set). The context parameter for LUTI and KL (both opt.and est.) was always the same, and is given in an extra column (ctx). Note thatno results for STALKER are included for data set s11-2, as this task containsalternative values, i.e. multiple elements to be extracted for the same field withinthe sequence extracted for the parent tuple. This cannot be represented in theembedded catalog formalism of STALKER.

Even though the KL algorithm with estimated parameters does not alwaysreach optimal scores, it is only beaten in 4 tasks of the 36 tasks, three times bySTALKER, and three times by LUTI. Moreover, while it performs better thanLUTI on 22 tasks and better than STALKER on 23 tasks (ou of 29 tasks). Notethat using optimal parameters for LUTI and BWI is not fair when comparingthem with STALKER, but makes the results for KL(est.) stronger. Remind thatSTALKER also has the advantage of getting extra information.

7.3.2 Interactive Approaches

In an interactive approach, we expect the user to continue until perfection isreached. We therefore are interested, in a comparison of the number of interactions,needed to reach a 100% F1-score, by the different approaches.

For our interactive approach, the setup of the experiment is as follows. Initiallya single random example is given to the algorithm. On every iteration, the algo-rithm learns a new hypothesis. The user input is simulated by taking a randomelement from the set of false positives and false negatives. The algorithm stops,when a 100% F1-score is obtained (no more false positives or false negatives). Weuse the same tasks as in the experiment of Section 7.3.1. Each task is performed30 times with random examples, the results are averaged. In Table 7.2 we showthe number of interactions needed to learn the wrapper. The first column of thetable contains the data sets. For the interactive (k, l)-contextual algorithm we in-


clude a column to indicate the numbers of positive and negative examples2 needed(averaged), columns to show the learned k and l (these were the same for all 30runs), and a last column to indicate the total time needed by all the learning stepsin Algorithm 6.2, also averaged over the 30 runs.

For data set s29-2 the timings for our algorithm, are exceptionally large (oftenlarger than 30 minutes). We therefore used the following alteration to the basicalgorithm: if a solution is already found, and the algorithm takes longer than 5seconds, interrupt the algorithm and return the best solution so far. Hence forset s29-2 we show only the final parameters and the number of interactions, butno timings. An explanation for this behavior is the following: even though a localsolution is found for s29-2, the algorithm keeps trying higher l-values, to make sureno better solution is skipped (which happens for all data sets). The nodes in thelocal context of a field, typically have modest branching factors (also for s29-2),hence the local solution is reached in time. When the parameter l becomes largerthough, the forks will escape the local context around the target, and might includenodes with higher branching factors, leading to more complex tree automata. Thisis the cause of the longer timings in s29-2.

For the Co-Testing approach, we use the same setup as in (Muslea et al. 2003).There, a wrapper is learned, to extract only the target field from the sequencesextracted for its parent in the Embedded Catalog. And no wrappers are learned forthe other extraction tasks in the hierarchy. We restricted ourselves to this setup, asthe application of Co-Testing on list-extraction is not detailed in the paper, and theinduction of the extraction tasks in the top of the hierarchy becomes unpracticallyslow. In our experiment, each induction task starts with two random examples. Aslong as contention points exist, the Co-Testing approach asks the correct solutionfor one of the parent sequences (one containing the chosen contention point). Eachinduction task is performed 30 times, and the results are averaged. These resultsare also shown in Table 7.2. Column ‘P’ shows the average over all runs of thenumber of positive examples (the two random initial examples, plus the queriesby the algorithm). The last column holds the average time in milliseconds for thetotal of all induction steps in a single run. Again, these induction times are onsubsequences of the document, while for the interactive KL, they are on the wholedocument.

Note that for data set s13-4, we stopped the Co-Testing algorithm after 100queries, hence no 100% F1 score was reached. For data set s29-2, the Co-Testingalgorithm did stop, but also no 100% F1 score was reached. This implies that inthis data set both views made the same mistake, such that no extra contentionpoints could be found.

For the Co-testing approach, an interaction consists of a query about a singlecontention point. For our approach and for SQUIRREL, an interaction is a query

2P/N = 1/0 means that the initial (1,2)-wrapper derived from only one positive example is asolution.

7.4 Summary 157

to give a counterexample from a page (or multiple pages). Given a page on whichthe elements, extracted by the wrapper associated with the given hypothesis, aremarked, a user has to check whether all target fields are marked, and whetherall marked elements are indeed target fields. This can amount to a substantialamount of checks. With the graphical representation, given that the layout of thetarget fields is mostly regular, an uncolored target field, or a colored element awayfrom the target fields, really sticks out. The human pattern recognition ability isable to spot these anomalies in a glance (or a quick scroll for larger pages). In ourexperimental evaluation we will therefore count only the counterexamples givenby the user, as interactions, ignoring the many checks this might imply.

Hence, although the type of interactions are different (equivalence query op-posed to a single query on a subsequence), we believe that the work for the user isalmost the same. Hence we feel justified to point out that our proposed algorithmneeds significantly less user interactions than Co-Testing. Also the timings showthat the system is highly responsive and suited for interactive use.

We did not run similar experiments with the more recent SQUIRREL system.Given the reported results of their experiments in (Carme et al. 2007), we candraw some conclusions. Their wrappers are also able to reach a 100% F1-score onthe WIEN data set. Some problems seem to arise for data sets where there is aneed for context strings. But as they state, this can be solved using a preprocessingstep similar to ours, combined with a different pruning heuristic (although it wouldbe nice if the system can do this transparently for the user).

Concerning the number of interactions needed, we compare the number ofcounterexamples(P+N) needed by our system, with the number of counterexam-ples needed by SQUIRREL (#CLQ). Unfortunately, there are only two data-sets,on which these interactions are counted for both systems. On these two datasets, our system needed less interactions: Okra-names 2 versus 3.48, and Bigbook-addresses 2.3 versus 3.02, but there are not enough data sets to draw conclusionson.

7.4 Summary

In this chapter we have seen in detail some existing state of the art string ap-proaches (STALKER and BWI), and some existing state of the art tree approaches(SQUIRREL and an approach based on Local Unranked Tree Inference) towardswrapper induction.

We performed two sets of experiments. A first one to compare between the per-formance of different approaches for learning from positive examples only, learningfrom a small set of examples. These experiments show clearly that our approach issuperior over the state of the art approaches involved. A second set of experimentscompares interactive approaches. Each approach has to learn till a perfect wrap-per is achieved, and the number of examples is counted, along the time needed for


Data Interactive KL Co-Testingset P/N k l ms P ms

s1-1 1/1 1 3 21 2.5 35

s1-3 4/1.7 3 3 642 7.7 1595

s1-4 1/0 1 2 5 8.7 7376

s3-2 1/1 1 3 12 2.5 190

s3-3 1/0 1 2 4 2.5 77

s4-1 1/0 1 2 2 30.8 266528

s5-2 1/1 1 4 19 6.9 405

s8-2 1/1 1 3 12 2.3 9

s8-3 1/1.3 2 3 43 3.4 554

s10-2 1/1 1 3 9 3.0 34

s10-4 1/1 2 2 15 8.8 35072

s11-1 1/2.1 2 4 4191 80.3 4220

s11-2 1/1.6 2 4 732

s12-2 2/1.4 1 4 43 8.8 515

s13-2 1/1 1 3 10 2.6 23

s13-4 1/1 2 2 15 100+ 1387720

s14-3 1/0 1 2 3 5.6 441

s15-2 1/0 1 2 2 18.3 3912

Data Interactive KL Co-Testingset P/N k l ms P ms

s19-4 1/1 1 3 7 2.1 8

s20-3 1/0 1 2 3 2.8 29

s20-4 1/1.3 2 3 198 3.1 51

s20-5 1/2 2 3 1142 3.0 53

s20-6 1/1.3 2 3 44 3.1 3024

s22-2 1/1 1 4 26 3.77 88

s23-1 1/1 2 3 39 3.6 132

s23-3 1/1 1 3 12 8.8 382

s25-2 1/1 1 3 6 7.6 1246

s29-1 2/1.6 2 3 125 10.4 179879

s29-2 4.8/2.9 4 3 !15.9 114784

s30-2 2/1 1 3 12 2.5 5

bigbook-2 2/2 2 5 574

bigbook-3 1/1.3 2 4 100

okra-1 1/1.6 2 3 37

okra-2 1/1 2 3 118

okra-3 1/1.4 2 3 66

okra-4 1/1 2 3 103

Table 7.2: This table shows the number of interactions needed to learn the wrap-pers with either the interactive version of the (k, l)-contextual learning algorithm,and STALKER with Aggressive Co-Testing. The P/N column indicates the num-ber of positive and negative examples needed to reach 100% F1 score, alwaysstarting from a single positive example. The results were averaged over 30 ran-dom runs. The resulting k and l parameters were the same in each run, and givenin the columns k and l. The P column shows the number of positive examples re-quested by the Co-Testing algorithm, with a minimum of two for the two randominitial examples. The results were also averaged over 30 random runs.

7.4 Summary 159

the total induction. Our approach is shown to be superior over STALKER withthe Co-Testing, active learning approach.

Chapter 8

Hybrid Approach

Tree based approaches have the common limitation that they can only cope withnode extraction. The common reply on this issue is that the tree based approachshould be seen as a first step in a two level approach, in which sub node extractionis performed in a second step. In this chapter we show that this is indeed a viableapproach. We investigate ways to extend a tree based approach with a stringbased approach as a second step. We implement one possibility and compare thishybrid approach experimentally with a state of the art string based approach.

As string based approach we have chosen the STALKER system (Muslea et al.2001), which we use without the hierarchical approach. There is no need for ahierarchical approach as the sub node learning task in the second step is ofteneasier than the top level task in the hierarchical approach. As tree based approachwe use our system from Section 6.4.2, based on (k, l)-contextual tree languages.

In this chapter we briefly survey the structural configurations, in which subnode fields can occur. We use this information to choose between two differentapproaches of making a hybrid system. We discuss the changes needed to learn subnode extraction tasks with the interactive system and we perform an experimentalcomparison with the STALKER algorithm.

8.1 Occurrences of Sub Node Fields

We define a spanning node for a given occurrence of a field as the first commonancestor of the two boundary nodes. In the case that the start node and end nodeare the same node, the spanning node is defined to be this node itself. These twocases are represented schematically respectively in Figure 8.1.a and b. We willrefer to them as cases a and b. As can be seen in Figure 2.6, the ‘Name’ field isan example of case a, with as spanning node the ‘a’ node, which spans over thestart and end node. The ‘Type’ field in the same figure is an example of case b.

161

162 Hybrid Approach

a) b) c) d)

. . . start . . . end . . .

S

html

field X

S

html

X

. . .start1. . . end1 . . .start2. . . end2 . . .

S

html

X X

S

html

X X

Figure 8.1: Schematic representation of the different possible configurations inwhich an occurrence of a field can be found in a tree. The broken lines indicate anancestor relation of one or more levels deep (The intermediate (irrelevant) nodesare left out).

Note that the boundary nodes are not necessarily at the same depth in the tree,as illustrated by the ‘Phone’ field. For this occurrence the spanning node is the‘p’ node.

In the Restaurant Guide example, all occurrences of a same field have differentspanning nodes. It is possible though, for different occurrences, to share the samespanning node. This case (case c) is represented in Figure 8.1.c. This case canalso degenerate such that the boundary nodes and spanning node coincide. Hencemultiple field values can be extracted from a single text node. This is illustratedin Figure 8.1.d, and we refer to it as case d.

8.2 Possible Approaches

We recognize two approaches to combine the tree based node extraction with atoken sequence based subsequence extraction. A first approach is to extract (orlearn to extract) the spanning node, and then extract (or learn to extract) thecorrect subsequence from the sequence obtained by flattening the subtree thatstarts at the spanning node. From this sequence we remove the initial (before thefirst text node) and trailing (after the last text node) mark up tags.

Example 8.1 In Example 7.2, the Name, Type, City, and Phone fields are allextracted from the sequence that is extracted for the Restaurant tuple (the previouslevel in the hierarchy). In the first approach the spanning node ‘a’ is extracted,and sequence based extraction is then performed on the sequence defined by thisspanning node:

New China Town ( chinese )

This sequence is smaller than in the hierarchical STALKER approach. The endrule BackTo(‘(’) suffices, as opposed to the BackTo(</a>) BackTo(‘(’) rule

8.3 Interactive System 163

given in Example 7.2. For the Type field, the sequence is even smaller. The span-ning node is a text node (case b). The City field simplifies to node extraction, nosequence extraction is needed. Only the Phone field, with spanning node ‘p’ willneed sequence extraction from the same sequence as in the hierarchical STALKERapproach. For the other fields, extraction and rule induction are performed onsmaller sequences, leading to smaller and more correct rules, and a faster induc-tion.

The second approach performs in a first step two node extraction tasks, one forthe start node, and one for the end node. In a second step, the start boundary isretrieved from the start node, and the end boundary from the end node. Althoughtwo different sequences are used in this approach, these sequences are in generalsmaller than in the first approach.

On a case by case basis, we see that for case a, the second approach is better,because the learning algorithm will perform better on smaller sequences. For caseb, the second approach is an overkill as there is no need to extract the samenode twice. In case c and d, the first approach will not suffice with a single levelsequence extraction. We need to use a limited hierarchical extraction that willdo a list extraction of the multiple field values under the single spanning node.For case c, the second approach will be able to extract the different start and endnodes separately, and will not need a hierarchical sequence extraction. However,for extracting all targets in case d, the second approach also requires the use ofhierarchical sequence extraction.

Overall, the second approach seems to be the preferable one. But having a lookat real world extraction tasks, it turns out that the Restaurant Guide example isatypical. It is contrived to contain two fields, which are in case a. Indeed, theoverwhelming majority of extraction tasks we looked at is either case b, or asingle node extraction without the need for sequence extraction. Extraction taskssituated in case a, c, or d occur rarely. Hence the first approach is not too badafter all.

As the goal of this chapter is to explore the viability of a hybrid scheme, itis sufficient to implement one approach and to compare it with the hierarchicalSTALKER system. We have chosen for the first approach, as it is a more straight-forward extension to our existing system that is already able to extract a singlenode, so we only had to add a postprocessing step.

8.3 Interactive System

We have extended the system described in Section 6.4.2. Instead of initially click-ing on a single text node, the user selects a subsequence as the initial positiveexample. The system enters a loop in which it interacts with the user to improvethe wrapper, until the user is satisfied. In each iteration, it induces a hypothesis

164 Hybrid Approach

based on the given positive and negative examples. The extraction results for thecurrent document are visualized, and the user is invited to give counterexampleswhen the hypothesis is not perfect. For false negatives, the user simply selectsa new positive example. For false positives we distinguish two cases. Either anextraction is shown at a correct position, but the extraction itself is too big ortoo small. The user can then select the correct extraction, providing a correctionto the system. Or the extraction is at a position without a target value in theneighborhood. The user can indicate this, providing a new negative example.

Internally, the spanning nodes of the example selections (both new positiveexamples and correction) are retrieved to find the set of positive node examples.The spanning nodes of the rejected extractions are collected to form the set of neg-ative nodes. Based on these two sets, the induction algorithm for (k, l)-contextuallanguages learns a set of parameters and the associated marked tree language forthe node extraction.

Next, for a (new) positive example, the sequence under the spanning tree to-gether with the selected field provides a (new) example for the STALKER induc-tion algorithm.

Note that a new negative example requires the system to learn only the extrac-tion of the spanning node again, because the set of examples used by STALKERis preserved, given that STALKER only uses positive examples. A correction, onthe other hand, will often not affect the position of the spanning node, in whichcase it only provides a new example for STALKER.

8.4 Experiments

In our experimental setup we want to compare the number of interventions by theuser needed to learn a correct wrapper. For hierarchical STALKER this meansthat for every level the correct rules have to be learned. For every level the userhas to give two initial examples, and extra corrections until no more mistakes canbe found. For the hybrid approach, the user has to give a single initial example,and as many false positives, false negatives, and corrections as needed to learn aperfect wrapper. To simulate the user, we choose the annotated training set tofind all mistakes, and take a random one to pass to the learning algorithm.

In Table 8.1 we show the averaged results of 30 runs on each data set. We givethe induction time for the two approaches in column ms. For the hybrid approachwe give the final k and l values, and the number of Positive examples, Negativeexamples, and Corrections. For the hierarchical STALKER approach, we show thenumber of Positive examples, split over the different levels (starting on left withthe top level). When we compare the total number of interactions (P+N+C and Psummed over all levels), it is clear that the hybrid approach requires substantiallyless user interactions. The sequence extraction step in the hybrid approach, andthe extraction in the final level of the STALKER approach extract the same text

8.5 Related Work 165

Data Hybrid STALKERset P/N/C k l ms P ms

s1-1 1/1/0 1 3 18 3/72.1/2.9 4442

s1-3 4/1.8/0 3 3 612 3/60.6/6.9 3651

s3-2 1/1/0 1 3 10 2.9/2.1/2.2 460

s3-3 1/0/0 1 2 3 2.5/2.1/2.3 316

s3-4 1/0/1.2 1 2 6 2.8/2.1/3 394

s3-5 1/0/4.9 1 2 48 2.8/2.1/5.4 554

s3-6 1/0/3.4 1 2 7 2.9/2/6.1 27520

s4-1 1/0/0 1 2 2 4.5/2.2/2.3 1136896

s4-2 1/1/0 2 3 33 4.7/3/2.1 1240828

s4-3 1/1/1 2 3 23 4.7/2.2/2 1420509

s4-4 1/1/1 2 3 22 4.8/2.7/2 1333724

s5-2 1/1/0 1 4 18 2.8/3.7/2.9 1136

s8-2 1/1/0 1 3 12 2.4/2/2.6 785

s8-3 1/1.2/0 2 3 39 2.3/2.1/2.9 675

s12-2 2/1.4/0 1 4 36 2.7/80.6/2.3 1394

s14-1 1.1/1/0.9 2 2 21 2/2.3/2.3 21

s14-3 1/0/0 1 2 3 2/2.2/2.4 21

s15-2 1/0/0 1 2 2 2.9/2.1/2.1 5

s19-2 1/1/0 2 2 13 2/2.1/2.1 85

s19-4 1/1/0 1 3 7 2/2/2 85

Data Hybrid STALKERset P/N/C k l ms P ms

s20-3 1/0/0 1 2 3 2/2/2.1 155

s20-4 1/1.4/0 2 3 192 2/2/3 165

s20-5 1/1.8/0 2 3 1067 2/2/2 158

s20-6 1/1.4/0 2 3 41 2/2/3.1 159

s23-1 1/1/0 2 3 35 2.6/3.1/4.1 606

s23-3 1/1/0 1 3 11 2.6/2.6/3.4 602

s25-2 1/1/0 1 3 5 2.6/5.1/3.0 82

s27-1 1/1.6/1 2 6 195 2.7/2.4 44

s27-2 1/1.3/1 2 6 190 2.7/2.8 46

s27-3 1/1.4/1 2 6 235 2.7/2.8 52

s27-4 1/1.2/1 2 6 348 2.6/6.7 3430

s27-5 1/1/1 2 5 125 2.7/2.5 33

s27-6 1/1/0 2 5 101 2.9/2.3 44

s30-2 2/1/0.7 1 3 14 2/2.8/2.6 8

s30-3 2/1/0 1 3 14 2/3/2 8

s30-4 2/1/0 2 2 50 2/3.3/2.5 12

s30-5 2/1/0.2 2 2 26 2/2.7/2.1 7

s30-6 2/1/0 2 2 49 2/3.1/2.8 9

s30-7 2/1/0 2 2 28 2/2.5/2.1 6

s30-8 2/1/0 2 2 25 2/2.6/2.6 6

Table 8.1: Comparison of the number of interactions needed to learn a per-fect wrapper, between our hybrid approach, and the sequence based approach(STALKER).

value. When we compare the number of positive examples needed to learn this lastextraction (P+C compared with the last number in P), we see that this numberis again smaller for the hybrid approach. This is because the tree based approachreturns a much smaller sequence to extract from, as illustrated in Example 8.1.

8.5 Related Work

Another approach to combine sequence based and tree based methods is describedin (Jensen and Cohen 2001). A (set covering) meta learning algorithm runsthe learning algorithms of different wrapper modules, evaluates their results andchooses the best resulting rules to add to the final solution. Some of these mod-ules are defined to combine other modules to allow conjunctions or a multi levelapproach like ours. In contrast to our approach, the algorithm requires completelyannotated documents (or at least a completely annotated part of the document).

166 Hybrid Approach

8.6 Summary

In this chapter we validated the assertion that sub node extraction can be done intwo steps: a node extraction task followed by a simple string based extraction taskwithin the value of the node. We proposed an approach that uses node extractionbased on (k, l)-contextual tree languages as a first step, and performs, in a secondstep, string based extraction with the STALKER system. An experimental com-parison with the hierarchical STALKER system, performing sub node extractionin a single step, shows that our hybrid approach is superior in both the amount ofexamples needed, and the total induction time.

Chapter 9

Conclusions and FurtherWork

This concluding chapter provides a summary of the presented work, and discussespossible further work.

9.1 Conclusions

We have introduced a new representation for the transition function of tree au-tomata. We discussed and proved properties of this representation, and comparedit with existing representations. We found it to be more practical than thoserepresentations. We provided algorithms to construct a deterministic and a min-imal version of a given tree automaton, and we elaborated more generally on theminimization of tree automata.

We have given a representation for wrappers for information extraction, usingautomata that accept marked documents. This is based on the assumption thatan acceptor can be found accepting only documents that are marked accordingto the extraction task at hand. We described a way to perform extraction withthese wrappers in a single run over a tree, and we developed the necessary toolsto handle these wrappers.

After introducing tree automata and their use for wrapper representation, weset upon the induction part of this work. We defined a subclass of the class ofregular tree languages: the class of (k, l)-contextual tree languages. We provedthat this subclass is learnable from positive examples only, and we described analgorithm to learn from sets of trees, which basically collects the building blocks ofa given size in these trees. Moreover, we presented a two-step algorithm to learna (k, l)-contextual tree language as a tree automaton directly.

167

168 Conclusions and Further Work

A wrapper induction algorithm based on (k, l)-contextual tree languages wasgiven. For practical use, a parameter estimating heuristic was given. Additionally,a way to search for optimal parameters, given a set of negative examples, was pro-posed. This latter approach was expanded into an interactive algorithm, guidinga user to present only non redundant examples while searching incrementally fora perfect wrapper. We presented our implementation of this approach. This is animplementation with a graphical user interface.

We went into some detail on alternative state of the art approaches. In anexperimental comparison, our approach had favorable results. We compared boththe performance on a small set of positive examples, and the number of positiveand negative examples needed to learn a perfect wrapper in an interactive setting.

We presented a hybrid approach able to perform sub node extraction, consist-ing of a node extraction step using our (k, l)-contextual approach, followed witha string based extraction from that node with STALKER. Compared experimen-tally with STALKER used in a single step, this hybrid approach showed superiorperformance. This indicates that a tree based approach only able to perform nodeextraction, combined with a postprocessing step for sub node extraction, is a validapproach.

9.2 Further Work

9.2.1 Tree Automata Optimization

Due to the nice timings obtained for the experiments with our interactive system(Section 7.3.2), the priorities for further optimizations of the tree automata op-erations, became too low to try for in this work. Below we give some ideas forfurther optimizations.

In Section 6.1.2, we showed a more optimal way to learn wrappers directly asautomata, in which we avoided traversing the whole tree for a given example. Weuse this approach in our interactive system. However in this system, several wrap-pers are learned from the same set of data, for different values of the parameters kand l, when searching for optimal values that accept all positive examples, and re-ject all negative examples. The calculation of these related wrappers, is currentlyperformed separately. It might be an easy extension to merge some of them, toreduce redundant calculations. Indeed, (k, l)-forks and (k, l + 1)-forks share manyof their intermediate (k, i)-roots. Also the generalization from a (k, l)-contextualtree acceptor, to a (k − 1, l)-contextual tree acceptor should be feasible, withoutneeding the original examples. An existing (k, l)-contextual acceptor needs to berelearned when a new positive example is found. When we keep the intermediatemarked fork set acceptor, we need to run the GetForks algorithm (the variation formarked forks from Section 6.1.2), only on that new example, to update the markedfork set acceptor, and from there construct the new (k, l)-contextual acceptor.

9.2 Further Work 169

Concerning the minimization of tree automata, we remark that optimizationsfor string automata often depend on clever data structures to efficiently find ev-idence for splitting a partition. It seems likely that some of these optimizationtechniques from string minimization algorithms can be adapted to give a treeminimization algorithm of comparable complexity.

9.2.2 Extensions to (k, l)-Contextual Tree Languages

The concept of contextual languages easily carries over to graphs. A possiblechoice for building blocks of a certain graph are its subgraphs of size k (numberof nodes) for which every two nodes of a subgraph are connected via an internalpath. The definition of an internal path being that each node along the path isalso in that subgraph. This is illustrated in Example 9.1. Apart from the abilityto learn graph languages, marked graphs can be used to learn local rules to selectgraph nodes.

The viability and strength of this approach on typical graph problems (for ex-ample learning from molecule descriptions in bioinformatics) remains to be studied.It is also not clear whether a representation as automata is feasible.

Example 9.1 We present some toy problem, to illustrate the extension to con-textual graph languages. The nodes of the graphs are bricks in a building, and twonodes are connected, when the two bricks share a side. We want to learn a lan-guage, of buildings similar to a given set of buildings. Some buildings are shown inFigure 9.1. Using ‘Gates’ as an example, we can learn languages for each value ofk. We see that the language with k = 1 contains all the other buildings, because allbuildings use the same set of bricks (alphabet). The ‘Castle’ is not contained in the2-language defined by ‘Gates’, since the 2-clump , is no 2-clump of ‘Gates’.‘Viaduct’ is the only building that is part of the 3-language of ‘Gates’. For k = 4,also ‘Viaduct’ is rejected, as it contains a 4-clump , that is no 4-clump of‘Gates’. We have not shown the 4-clumps of ‘Gates’, due to space restrictions,but it is easily checked that the specified 4-clump of ‘Viaduct’ does not occur in‘Gates’.

Another interesting extension to pursue might be the use of stochastic modelsbased on (k, l)-contextual languages. In previous work, an extension of k-testabletree languages (Garcıa and Vidal 1990), was proposed in the form of probabilistick-testable tree languages (Rico-Juan et al. 2000). These were applied to theproblem of syntactic disambiguation of natural language parses (Verdu-Mas et al.2005). Given the ability of the contextual languages to avoid overgeneralization,improvements could be made when learning tree-bank grammars (Charniak 1996).It would be interesting to investigate whether a probabilistic extension of (k, l)-contextual languages can maintain or improve on these results.


k = 1 k = 2 k = 3

Gates { }

{ }

{ }

{ }

{ }

{ }

{ }

{ }. . .

{ }. . .

Viaduct

Holes

Castle

Figure 9.1: Sets of k-clumps of graphs(buildings)

9.2 Further Work 171

9.2.3 Wrapper Extensions

In this section we point out some issues with wrapper extraction, that were notfully addressed in this work.

We did mostly ignore the existence of tag attributes in HTML code. An ex-ample of an attribute, is the attribute named ‘href’ in an ‘a’ tag, as can be seenin Figure 2.1. In other examples, attributes were left out from the code. To makeour wrappers aware of attributes, they need to be able to extract the values ofattributes. For example extracting a URL from a ‘href’ or ‘src’ attribute. Forextraction, only a simple extension is required, as several HTML parsers attachthe information to the nodes. The extraction task is then split up in extracting thecorrect node, and querying for the correct attribute name. Next to extraction, at-tribute values can also be used as extra information to distinguish target elementsfrom other elements (for example attributes like ‘id’ and ‘class’). One solution isto add the attribute-value pairs as a string to the tag of the node. A disadvantageof this approach is that more examples might be needed to learn the wrapper. An-other approach that seems viable, is to treat attributes in a similar way as we treatcontext. For context a boolean parameter is used to indicate whether it should beused. The value of this parameter can be learned (Section 6.3.2). Boolean param-eters could be added for a pair of a tag and an attribute name. This parameterwould then indicate whether for the given tag, the attribute (name and value)should be attached to the tag name. Only parameters need to be added for pairsof tag and attribute that occur within the marked forks.

Our approach is geared towards single field extraction. Concerning tuple ex-traction, we believe that it should be solved with a multi-level approach (just likesub node extraction), with multiple single field extractions as first step. On topof this first step, we envision a tuple aggregation procedure. This introduces anew learning task. Based on extracted fields and a user giving some examples ofwhich fields belong to which tuple, a tuple aggregator should be learned. Differentapproaches are possible. If the fields of an example tuple share a common ancestordifferent from the ancestors from other fields, this common node could be learnedfor each tuple as a node extraction task. The (k, l)-contextual approach seemsperfectly fit, and the extraction of the common node would be independent of theextraction of single fields, preventing accumulation of errors. Another approach isto find a regularity in the sequence of extracted elements. For in the Student Listexample, this sequence is [N(Stefan)], [S(Maurice)], [N(Anneleen)], [S(Hendrik)](with N the marker for the name field, and S the marker for the supervisor field).An aggregator induction algorithm has to learn that a tuple consists of a subse-quence of fields starting with an N-field, and ending with an S-field. This approachworks also when tuples all share the same ancestors. Inspecting the WIEN datasets manually, it seems that these two approaches alone would solve most of theaggregation tasks in these data sets. An integration in our system, where an appro-priate choice of aggregation procedure is made based on negative examples seems


straightforward. The two level approach has the flexibility that tuple aggregationcan easily be exchanged, whereas an integrated approach as in STALKER, hassome inherent drawbacks, as the general structure of the document can only bedescribed in terms of lists and tuples.

Coming back to sub node extraction, it might be interesting to try anotherapproach, instead of the hybrid approach proposed in Chapter 8. We could ex-tend the tree formalism in the (k,l)-contextual approach. Every text node can bereplaced by the root of a subtree that contains the tokens in that text node aschildren. The extraction task will become a dual extraction task: extracting theinitial token (which has become a node in the tree), and extracting the last token(also a node). Presumably, extending the single wildcard towards a hierarchicalwildcard type system as in string based methods would be beneficial.

References

Aho, A. V., R. Sethi, and J. D. Ullman (1986). Compilers: Principles, Tech-niques, and Tools (World Student Series Edition ed.). Addison-Wesley seriesin Computer Science. Addison-Wesley Publishing Company. “The DragonBook”.

Ahonen, H. (1996). Generating grammars for structured documents using gram-matical inference methods. Ph. D. thesis, University of Helsinki, Departmentof Computer Science.

Angluin, D. (1982). Inference of reversible languages. Journal of the ACM(JACM) 29 (3), 741–765.

Angluin, D. (1988). Queries and concept-learning. Machine Learning 2, 319–342.

Blum, N. (1996). An O(n log n) implementation of the standard method forminimizing n-state finite automata. Inf. Process. Lett. 57 (2), 65–69.

Bruggemann-Klein, A., M. Murata, and D. Wood (2001). Regular tree andregular hedge languages over unranked alphabets. Technical Report HKUST-TCSC-2001-05.

Califf, M. E. and R. J. Mooney (1999). Relational learning of pattern-matchrules for information extraction. In AAAI ’99/IAAI ’99: Proceedings of thesixteenth national conference on Artificial intelligence and the eleventh In-novative applications of artificial intelligence conference innovative applica-tions of artificial intelligence, Menlo Park, CA, USA, pp. 328–334. AmericanAssociation for Artificial Intelligence.

Carme, J., R. Gilleron, A. Lemay, and J. Niehren (2007). Interactive learningof node selecting tree transducers. Machine Learning 66 (1), 33–67.

Carme, J., A. Lemay, and J. Niehren (2004, October). Learning node selectingtree transducer from completely annotated examples. In International Collo-quium on Grammatical Inference, Volume 3264 of Lecture Notes in ArtificialIntelligence, pp. 91–102. Springer Verlag.

Carme, J., J. Niehren, and M. Tommasi (2004). Querying unranked trees with

173

174 References

stepwise tree automata. In International Conference on Rewriting Tech-niques and Applications, Aachen, pp. 105–118.

Charniak, E. (1996). Tree-bank grammars. In AAAI/IAAI, Vol. 2, pp. 1031–1036.

Chidlovskii, B., J. Ragetli, and M. de Rijke (2000). Wrapper generation viagrammar induction. In Proc. 11th European Conference on Machine Learn-ing (ECML), Volume 1810, pp. 96–108. Springer, Berlin.

Cicchello, O. and S. Kremer (2003). Inducing grammars from sparse data sets:a survey of algorithms and results.

Comon, H., M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, andM. Tommasi (1999). Tree Automata Techniques and Applications. Availableon: http://www.grappa.univ-lille3.fr/tata.

Cristau, J., C. Loding, and W. Thomas (2005). Deterministic automata onunranked trees. In Proceedings of the 15th International Symposium on Fun-damentals of Computation Theory, FCT 2005, pp. 68–79.

Denis, F. (2001). Learning regular languages from simple positive examples.Machine Learning 44 (1/2), 37–66.

Denis, F., C. D’Halluin, and R. Gilleron (1996). PAC learning with simpleexamples. In Symposium on Theoretical Aspects of Computer Science, pp.231–242.

Freitag, D. (1998). Information extraction from HTML: Application of a generalmachine learning approach. In AAAI/IAAI, pp. 517–523.

Freitag, D. and N. Kushmerick (2000). Boosted wrapper induction. In Pro-ceedings of the Seventeenth National Conference on Artificial Intelligenceand Twelfth Innovative Applications of AI Conference, pp. 577–583. AAAIPress.

Freitag, D. and A. McCallum (1999). Information extraction with HMMs andshrinkage. In AAAI-99 Workshop on Machine Learning for Information Ex-traction.

Garcıa, P. (1993). Learning k-testable tree sets from positive data. TechnicalReport DSIC/II/46/1993, Departamento de Sistemas Informaticos y Com-putacion, Universidad Politecnica de Valencia.

Garcıa, P. and E. Vidal (1990). Inference of k-testable languages in the strictsense and application to syntactic pattern recognition. IEEE Trans. PatternAnal. Mach. Intell. 12 (9), 920–925.

Gold, E. M. (1967). Language identification in the limit. Information and Con-trol 10 (5), 447–474.

Gottlob, G. and C. Koch (2004). Logic-based web information extraction. SIG-MOD Rec. 33 (2), 87–94.

References 175

Gulli, A. and A. Signorini (2005). The indexable web is more than 11.5 billionpages. In WWW 2005.

Hopcroft, J. E. (1971). An n log n algorithm for minimizing states in a finiteautomaton. In Z. Kohavi and A. Paz (Eds.), Theory of Machines and Com-putations, pp. 189–196. New York: Academic Press.

Hsu, C.-N. and M.-T. Dung (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information Systems 23 (8), 521–538.

Huffman, D. A. (1964). The synthesis of sequential switching circuits. In E. F.Moore (Ed.), Sequential Machines: Selected Papers. Addison-Wesley.

Jensen, L. S. and W. W. Cohen (2001). A structured wrapper induction systemfor extracting information from semi-structured documents. In Proc. of theIJCAI-2001 Workshop on Adaptive Text Extraction and Mining.

Knuutila, T. (1993). Inference of k-testable tree languages. In H. Bunke (Ed.),Advances in Structural and Syntactic Pattern Recognition: Proc. of the Intl.Workshop, Singapore, pp. 109–120. World Scientific.

Kosala, R. (2003). Information Extraction by Tree Automata Inference. Ph. D.thesis, Department of Computer Science, Katholieke Universiteit Leuven.

Kosala, R., H. Blockeel, M. Bruynooghe, and J. Van den Bussche (2006). Infor-mation extraction from structured documents using k-testable tree automa-ton inference. Data and Knowledge Engineering 58 (2), 129–158.

Kosala, R., M. Bruynooghe, H. Blockeel, and J. Van den Bussche (2003). Infor-mation extraction from web documents based on local unranked tree automa-ton inference. In Intl. Joint Conference on Artificial Intelligence (IJCAI),pp. 403–408.

Kosala, R., J. Van den Bussche, M. Bruynooghe, and H. Blockeel (2002). Infor-mation extraction in structured documents using tree automata induction.In PKDD, Volume 2431 of Lecture Notes in Computer Science, pp. 299–310.Springer.

Kushmerick, N., D. S. Weld, and R. B. Doorenbos (1997). Wrapper inductionfor information extraction. In Intl. Joint Conference on Artificial Intelligence(IJCAI), pp. 729–737.

Lopez, D., J. M. Sempere, and P. Garcıa (2004, August). Inference of reversibletree languages. IEEE Trans. on Systems, Man and Cybernetics. PART B:Cybernetics 34 (4), 1658–1665.

Martens, W. and J. Niehren (2006). On the minimization of XML schemas andtree automata for unranked trees. Journal of Computer and System Science.

McNaughton, R. (1974). Algebraic decision procedures for local testability.Math. Systems Theory 8 (1), 60–76.

176 References

Moore, E. (1956). Gedanken-experiments on sequential machines. In C. Shannonand J. McCarthy (Eds.), Automata Studies, pp. 129–153. Princeton, NJ:Princeton University Press.

Muggleton, S. (1990). Inductive Acquisition of Expert Knowledge. Addison-Wesley.

Muslea, I., S. Minton, and C. Knoblock (2001). Hierarchical wrapper inductionfor semistructured information sources. Journal of Autonomous Agents andMulti-Agent Systems 4, 93–114.

Muslea, I., S. Minton, and C. Knoblock (2003). Active learning with strong andweak views: A case study on wrapper induction. In Intl. Joint Conferenceon Artificial Intelligence (IJCAI), pp. 415–420.

Mkinen, E. (1996). Inferring uniquely terminating regular languages from pos-itive data. Report A-1996-9, Department of Computing Science, Universityof Tampere, Finland.

Neven, F. (2002). Automata theory for XML researchers. SIGMOD Rec. 31 (3),39–46.

Oncina, J.; Garca, P. (1992). Inferring regular languages in polynomial updatetime. Pattern Recognition and Image Analysis, 1991,, 49–64.

Parekh, R. and V. Honavar (1997). Learning DFA from simple examples. InAlgorithmic Learning Theory, 8th International Workshop, ALT ’97, Sendai,Japan, October 1997, Proceedings, Volume 1316, pp. 116–131. Springer.

Parekh, R. and V. Honavar (2000). Grammar inference, automata induction,and language acquisition.

Raeymaekers, S. and M. Bruynooghe (2004). Minimization of finite unrankedtree automata. Manuscript.

Rico-Juan, J. R., J. Calera-Rubio, and R. C. Carrasco (2000). Probabilistic k-testable tree languages. In A. Oliveira (Ed.), Proceedings of 5th InternationalColloquium, ICGI, pp. 221–228.

Schapire, R. E. and Y. Singer (1999). Improved boosting algorithms usingconfidence-rated predictions. Machine Learning 37 (3), 297–336.

Soderland, S. (1999). Learning information extraction rules for semi-structuredand free text. Machine Learning 34 (1-3), 233–272.

Verdu-Mas, J. L., R. C. Carrasco, and J. Calera-Rubio (2005). Parsing withprobabilistic strictly locally testable tree languages. IEEE Trans. PatternAnal. Mach. Intell. 27 (7), 1040–1050.

Watson, B. W. (1994). A taxonomy of finite automata minimization algorithms.Report, Department of Mathematics and Computing Science, EindhovenUniversity of Technology, The Netherlands.

Publication List

Contributions at international conferences

Published in proceedings

• S. Raeymaekers, and M. Bruynooghe, A hybrid approach towards wrapperinduction, Proceedings of the Workshops Prior Conceptual Knowledge inMachine Learning and Data Mining, and Web Mining 2.0 (Berendt, B. andMladenic, D. and Semeraro, G. and Spiliopoulou, M. and Stumme, G., eds.),pp. 161-172, 2007

• S. Raeymaekers, and M. Bruynooghe, Wrapper induction: Learning (k,l)-contextual tree languages directly as unranked tree automata, Proceedingsof the International Workshop on Mining and Learning with Graphs (MLG-2006) (Gaertner, T. and Garriga, G.C. and Meinl, T., eds.), pp. 197-204,2006

• S. Raeymaekers, M. Bruynooghe, and J. Van den Bussche, Learning (k,l)-contextual tree languages for information extraction, ECML 2005, 16th Eu-ropean Conference on Machine Learning, Proceedings (Gama, J. and Cama-cho, R. and Brazdil, P. and Jorge, A. and Torgo, L., eds.), vol 3720, LectureNotes in Computer Science, pp. 305-316, 2005

• S. Raeymaekers, and M. Bruynooghe, Parameterless information extrationusing (k,l)-contextual tree languages, BNAIC-2004, Proceedings of the Six-teenth Belgium-Netherlands Conference on Artificial Intelligence (Verbrugge,R. and Taatgen, N. and Schomaker, L., eds.), pp. 211-218, 2004

• H. Blockeel, K. Driessens, N. Jacobs, R. Kosala, S. Raeymaekers, J. Ra-mon, J. Struyf, W. Van Laer, and S. Verbaeten, First order models for thepredictive toxicology challenge, ECML/PKDD Workshop : The PredictiveToxicology Challenge 2000-2001 (Helma, C. and King, R. and Kramer, S.and Srinivasan, A., eds.), pp. 1-12, 2001

177

178 Publication List

Not published or only as abstract

• S. Raeymaekers, Experimental comparison of wrapper induction systems,Former Freiburg, Leuven and Friends Workshop on Machine Learning, FLF-07, Massembre (Heer), Belgium, March 21-23, 2007

• S. Raeymaekers, Demonstration of wrapper learning for information extrac-tion (k,l)-contextual tree languages, 18th Belgium-Netherlands Conferenceon Artificial Intelligence, BNAIC 2006, Namur, Belgium, October 5-6, 2006

• S. Raeymaekers, Information extraction tool using (k,l)-contextual tree au-tomata, Freiburg, Leuven and Friends Workshop, FLF’05, Ferrieres, Bel-gium, March 7-9, 2005,

• S. Raeymaekers, Minimization of tree automata, 5th ”Freiburg, Leuven andFriends” Workshop on Machine Learning, FLF-04, Hinterzarten, Germany,March 8-10, 2004

• S. Raeymaekers, Automata Inference: label generalisation, generalisation ev-idence, 4th ”Freiburg, Leuven and Friends” Workshop on Machine Learning,FLF-03, Leuen/Dourbes, Belgium, March 19-21, 2003,

• S. Raeymaekers, and H. Blockeel, Optimisation of automatic constant gener-ation for ACE, 2nd Leuven-Freiburg Workshop on Machine Learning, LF-01,Oostkamp, Belgium, March 14-16, 2001

• S. Raeymaekers, K. De Vlaminck, G. Janssens, and T. Mahieu, Adaptation oftwo-level morphology for use in a real world application, 9th ComputationalLinguistics in the Netherlands, CLIN’98, Leuven, Belgium, 12 december 1998

• T. Mahieu, S. Raeymaekers, K. De Vlaminck, G. Janssens, and W. Joosen,Base Architectures for NLP, 9th Computational Linguistics in the Nether-lands, CLIN’98, Leuven, Belgium, 12 december 1998

Technical reports

• S. Raeymaekers, M. Bruynooghe, and J. Van den Bussche, Learning (k,1)-contextual tree languages for information extraction, Department of Com-puter Science, K.U.Leuven, Report CW 390, Leuven, Belgium, September,2004

Parts of books

• P. Flach, H. Blockeel, T. Gartner, M. Grobelnik, B. Kavsek, M. Kejkula, D.Krzywania, N. Lavrac, P. Ljubic, D. Mladenic, S. Moyle, S. Raeymaekers, J.

Publication List 179

Rauch, S. Rawles, R. Ribeiro, G. Sclep, J. Struyf, L. Todorovski, L. Torgo,D. Wettschereck, and S. Wu, On the road to knowledge: Mining 21 Yearsof UK traffic accident reports, Data Mining and Decision Support: Integra-tion and Collaboration, (Mladenic, D. and Lavrac, N. and Bohanec, M. andMoyle, S., eds.), vol. 745, The Kluwer International Series in Engineeringand Computer Science, Kluwer, 2003, pp.143-156

180 Publication List

Biography

Stefan Raeymaekers was born on the 20th of January, 1974 in Leuven, Belgium. In1992, he finished high school at the Immaculata-instituut in Tienen. He receiveda Bachelor’s degree of Science in Engineering (Kandidaat Burgerlijk Ingenieur) in1994 and a Master’s degree of Science in Engineering in Computer Science (Burg-erlijk Ingenieur in de Computerwetenschappen) in 1997, both from the KatholiekeUniversiteit Leuven in Belgium.

In September 1997, he started working at the K.U.Leuven on a research projecton Natural Language Processing in cooperation with LANT N.V. This researchproject was partly funded by the Institute for the Promotion of Innovation byScience and Technology in Flanders (I.W.T.). From 2000 till 2008, he continuedworking at the K.U.Leuven, first in a position as teaching assistant, later on asresearch assistant. During this time he participated in several research projectscentered on Machine Learning and Data Mining.

He started his Ph.D. research in 2003, under the supervision of ProfessorsMaurice Bruynooghe and Jan Van den Bussche. This research, with as one of thehighlights the ECML best student paper award in 2005, culminated in a defenseon January 30, 2008.

181

NederlandstaligeSamenvatting

Informatie Extractie uit Web-Pagina’s

door Middel van Inductie van Boomautomaten

Stefan Raeymaekers

Beknopte Samenvatting

Het wereldwijde web is een onschatbare bron van informatie. Ondanksdat deze informatie gemakkelijk geınterpreteerd kan worden door een mens,is het spijtig genoeg niet triviaal voor een computerprogramma om die in-formatie te verwerken. Het doel van ‘Informatie Extractie uit web-pagina’s’is te leren hoe specifieke informatie kan teruggevonden worden in gestruc-tureerde tekst, vertrekkende van een aantal voorbeelden. Deze voorbeeldenbestaan uit een web-pagina waarin een van de voorkomens van de doeldataaangegeven is. Een alternatieve term die vaak gebruikt wordt is ‘wrapperin-ductie’, waarbij een ‘wrapper ’ een procedure is om data terug te vinden uiteen web-pagina.

Samengevat presenteren we in deze thesis een algemene manier om wrap-pers voor te stellen met boomautomaten en ontwikkelen we daarnaast eenbepaalde techniek om wrappers te leren in dit formaat, die betere resultatenboekt dan gerelateerde state of the art technieken.

Hiertoe introduceren en bespreken we een verbeterde voorstelling voorboomautomaten en weiden we uit over het bestaan, de uniekheid en de con-structie van minimale en deterministische automaten. We stellen een nieuwebenadering voor om boomautomaten te gebruiken als wrappers en werkendeze verder uit voor praktisch gebruik, onder andere met een efficient ex-tractiealgoritme.

We voeren een nieuw algoritme in voor het leren van boomtalen in hetalgemeen, niet beperkt tot wrappers. Dit algoritme kan van positieve voor-beelden alleen leren, omdat het enkel binnen een subklasse van de reguliereboomtalen leert, die leerbaar is uit enkel positieve voorbeelden, in tegenstel-ling tot de volledige klasse van reguliere bomen. We passen dit inductiealgo-ritme aan om wrappers te leren en breiden het uit tot een praktisch systeem.Dit houdt in het kiezen van parameters, een incrementeel aanduiden vanvoorbeelden en een grafische interface.

NL 2 Nederlandstalige Samenvatting

1 Inleiding

Situering van het onderzoek

Het staat buiten kijf dat het ‘wereldwijde web’ (WWW) de laatste jaren een enor-me opmars gemaakt heeft. In zoverre dat het voor veel mensen de eerste keu-ze geworden is wanneer ze informatie moeten opzoeken. Dat het tegenwoordigmeer regelmaat dan uitzondering is dat bedrijven hun webadres vermelden in hunbriefwisseling, is een van de aanwijzingen hoe sterk deze technologie in onze maat-schappij is doorgedrongen. Naast de meeste bedrijven en organistaties hebben ookvele particulieren een eigen website, waarop allerlei informatie aan de buitenwereldwordt prijsgegeven.

Een webpagina bestaat uit gestructureerde tekst, ze heeft een eigen opmaak.Dit kan gaan van stukje tekst in het vet te drukken, een indeling in paragrafenof in een tabel, tot speciale menu structuren of invulformulieren. Deze opmaakwordt als annotaties in de tekst zelf geplaatst, in een taal die HTML (HyperTextMarkup Language) wordt genoemd. Merk op dat deze taal ook koppelingen speci-fieert waarmee naar andere, meestal gerelateerde pagina’s gesprongen kan worden.Deze taal is echter niet voorzien om semantische annotaties te maken. Als eenwebwinkel een pagina publiceert met een lijst van zijn producten, kunnen er geenannotaties gemaakt worden die bijvoorbeeld aangeven welke stukjes tekst namenvan producten zijn en welke stukjes tekst de prijs van een product voorstellen.Hierdoor is het moeilijk om een programma automatisch informatie uit webpagi-na’s te laten halen, wat een vereiste is om grote hoeveelheden informatie praktischte kunnen verwerken.

Met de term ‘Informatie Extractie’ wordt het terugvinden van specifieke infor-matie uit een document bedoeld, waarbij die informatie niet triviaal is aangegevenin het document. Met ‘niet triviaal’ wordt bedoeld: niet triviaal voor een pro-gramma. Als in een nieuwsbericht te lezen staat: “Werknemers van N.V. Doppenprotesteren na onthullingen van de heer Jansen, die recentelijk tot bedrijfsleiderwerd benoemd.”, dan zal een mens door logisch redeneren gemakkelijk kunnen be-antwoorden wanneer de naam van de directeur van het bedrijf in kwestie gevraagdwordt. Voor een programma daarentegen is dit niet triviaal.

We onderscheiden informatie extractie uit doorlopende tekst en informatie ex-tractie uit gestructureerde tekst (b.v. webpagina’s). Voor doorlopende tekst kangebruik gemaakt worden van technieken uit natuurlijke taalverwerking om de on-derlinge relaties van zinsdelen te ontrafelen. Informatie in webpagina’s wordt vaakniet in zinnen gegeven, maar als losse elementen waartussen een relatie gesugge-reerd wordt door de opmaak. Dit maakt het nog moeilijker om informatie terugte vinden gegeven een algemene vraag. Vaak zien we echter dat gelijkaardige datain dezelfde pagina met dezelfde opmaak wordt weergegeven. Ook bij pagina’s diedoor hetzelfde script gegenereerd worden (bijvoorbeeld voor een webwinkel), zienwe dit fenomeen. In zulke gevallen is het mogelijk om de data terug te vinden met

1 Inleiding NL 3

behulp van een set van regels die de unieke opmaak rond de doeldata beschrij-ven. Zulk een set van regels wordt een ‘wrapper’ genoemd. Wanneer we sprekenover informatie extractie uit webpagina’s, bedoelen we een vereenvoudigde setting,waar het niet de bedoeling is antwoorden te vinden voor een bepaalde vraag, maareerder om gelijkaardige antwoorden te vinden wanneer voorbeeld antwoorden voorzo’n vraag gegeven worden. Met andere woorden we willen een wrapper leren aande hand van een aantal voorbeelden.

Een eindige toestandsautomaat is een automaat ontworpen om een rij vansymbolen te verwerken. De automaat bevind zich steeds in juist een toestand uiteen eindige verzameling van mogelijke toestanden. Bij verwerking van een symboolzal de automaat in een nieuwe toestand komen uit die verzameling (mogelijks isde nieuwe toestand gelijk aan de voorgaande). Welk de nieuwe toestand wordt isvolledig bepaald door de voorgaande toestand en het symbool dat verwerkt wordt.Deze automaten zijn interessant omwille van hun eenvoud, zowel in gebruik als inhet specifieren van operatie met zulke automaten, terwijl ze toch zeer expressiefzijn. Ze kunnen namelijk elke mogelijke reguliere taal voorstellen.

Een boom is een hierarchische structuur van knopen (voorgesteld door eensymbool), waarbij onder elke knoop een aantal verschillende deelbomen, de kinde-ren genoemd, hangen. Een knoop waaronder geen deelboom hangt wordt een bladgenoemd en de topknoop wordt de wortel genoemd. Annotaties in HTML mogenelkaar niet deels overlappen: ofwel overlappen ze niet, ofwel bevat een annotatiede andere. Daardoor vormt een HTML ook een hierarchische structuur. Dat wilzeggen dat we HTML documenten zowel als een rij van symbolen kunnen zien enals een boom. Een boomautomaat is een eindige toestandsautomaat die een boomverwerkt in plaats van een rij van symbolen.

Doel en motivatie

Er is een grote vraag naar wrappers die het mogelijk maken de informatie ophet web te verwerken. Gecombineerd met het feit dat het handmatig maken vanwrappers veel werk vraagt van gespecialiseerde ontwerpers, geeft dit een uitste-kende motivatie voor onderzoek naar het automatisch leren van wrappers aan dehand van voorbeelden. Een overzicht van reeds bestaand werk hierover kan gevon-den worden in (Kosala 2003). In deze thesis willen we een nieuwe techniek voorwrapperinductie ontwikkelen, die betere resultaten biedt dan bestaand werk.

Omwille van de gunstige eigenschappen van eindige toestandsautomaten zijndeze in sommige bestaande systemen gekozen om wrappers voor te stellen (som-mige systemen zien een document als een rij van symbolen, anderen zien docu-menten als bomen en gebruiken boomautomaten). Webpagina’s beschouwen alsbomen zorgt ervoor dat hierarchische relaties lokaal blijven, wat de kwaliteit vanwrapperinductie ten goede komt. In (Kosala et al. 2003) wordt dit experimenteelaangetoond. Daarom representeren we de wrappers in dit werk met behulp vanboomautomaten. Een tweede doel van deze thesis is dan ook de voorstelling van


boomautomaten te verbeteren en te onderzoeken hoe extractie met behulp vanboomautomaten efficienter kan worden geımplementeerd.

2 Boomautomaten en Wrapperrepresentatie

In deze sectie gaan we dieper in op het voorstellen van boomautomaten en hoe weze kunnen gebruiken voor extractie.

Algemene Definitie van een Boomautomaat

Een eindige toestandsautomaat die een rij van symbolen (ook ‘string’ genoemd)verwerkt, bereikt een bepaalde eindtoestand. Die eindtoestand kan recursief bere-kend worden als de toestand die bereikt wordt met het laatste symbool vanaf deeindtoestand bereikt met die automaat voor die string zonder het laatste symbool.

Analoog aan een eindige toestandsautomaat over strings, zal een eindige toe-standsautomaat over bomen een bepaalde eindtoestand bereiken wanneer eenboom verwerkt wordt. De eindtoestand wordt bepaald door het symbool vande wortel van de boom en de eindtoestanden van de kinderen van de wortel. Hier-voor moeten dus eerst weer recursief de eindtoestanden van de kinderen berekendworden.

Definitie 1 (Boomautomaat) Een boomautomaat wordt gedefinieerd door eentupel T = (Σi,Σo, Q, δ, φ) waarin Σi een set is van invoersymbolen, Σo een setvan uitvoersymbolen, Q een set van toestanden, φ een uitvoerfunctie Q → Σo enδ : (Σi × Q∗) → Q een overgangsfunctie van een invoersymbool en een rij vantoestanden naar een nieuwe toestand, zodanig dat voor elke a ∈ Σ en q ∈ Q, deverzameling {w ∈ Q∗ | ((a,w) → q) ∈ δ} een reguliere taal vormt over het alfabetQ.

De uitgebreide overgangsfunctie, zal door recursieve uitvoering een boom recht-streeks afbeelden op de eindtoestand. De verzameling bomen over een alfabet Σnoteren we als T (Σi).

Definitie 2 (Uitgebreide Overgangsfunctie) De definitie van de uitgebreideovergangsfunctie δ : T (Σi) → Q van een gegeven boomautomaat, is δ(f(s)) =δ(f,map(δ, s)), met f(s) ∈ T (Σi). De functie map(func, rij) geeft de rij terug dieresulteert uit het toepassen van de functie func op elk element van rij.

Als we als uitvoeralfabet Σo = {accepteren,weigeren} gebruiken, zal de uitvoer-functie elke eindtoestand definieren als een accepterende toestand of als een afwij-zende toestand. Op deze manier kunnen we de boomautomaat gebruiken als een‘boomherkenner’, die voor elke boom aangeeft of die al dan niet behoort tot detaal gedefinieerd door de boomautomaat.

2 Boomautomaten en Wrapperrepresentatie NL 5

a b a

b a

a

1 b a

b a

a

b a

2 a

a

2 a

2 a

a

2 1

2 a

a

2 1

a1

Figuur 1: Verwerking van een boom door een boomautomaat.

Voorbeeld 1 We beschouwen een boomherkenner die alle bomen accepteert be-staande uit de symbolen ‘a’ en ‘b’, die verder als wortel altijd het symbool ‘a’ heb-ben, waarin 2 opeenvolgende kinderen altijd verschillen en waarin elk eerste kindverschilt van zijn ouder. Een voorbeeld van deze taal is de boom ‘a(b(a)a(ba))’.Deze boom is grafisch voorgesteld, helemaal links in Figuur 1.

Deze herkenner heeft {a, b} als invoeralfabet, heeft {0, 1, 2} als set van toe-standen, en heeft als uitvoerfunctie φ = {0 → weigeren, 1 → accepteren, 2 →weigeren}. Van de overgangsfunctie tonen we enkel de overgangen die nodigzijn om de voorbeeldboom te herkennen: δ = {(a, ε) → 1, (b, ε) → 2, (a, 21) →1, (b, 1) → 2, . . . }, waarbij ε een lege rij toestanden voorstelt.

Om de voorbeeldboom te verwerken, moeten eerst recursief de deelbomen ver-werkt worden. In Figuur 1 tonen we de verschillende stappen in de verwerkingdoor elke verwerkte boom te vervangen door de bereikte toestand. De eindtoestandvoor de volledige boom is 1, een accepterende toestand, bijgevolg hoort de boom totde taal.

Merk op dat bomen in deze taal een willekeurig groot aantal kinderen kan heb-ben. Het aantal mogelijke overgangen in de overgangsfunctie is dus oneindig.

Voorstelling van de Overgangsfunctie

In tegenstelling tot automaten over strings, kan de overgangsfunctie van boom-automaten niet altijd door een eindige opsomming gedefinieerd worden (zie Voor-beeld 1). Wij stellen daarom een voorstelling voor de overgangsfunctie voor, diegebruikt maakt van een enkele string automaat. Deze automaat heeft een eigenset van toestanden. Om een onderscheid te maken met de toestanden van deeigenlijke boomautomaat zullen we spreken over overgangstoestanden en boom-toestanden. Zowel het invoer- als het uitvoeralfabet voor deze automaat bestaanuit de set van boomtoestanden. De initiele toestand voor de overgangsfunctieis afhankelijk van het invoersymbool voor de boomautomaat. Hierdoor zal de-ze automaat, gegeven een invoersymbool en een string van boomtoestanden, eenspecifieke eind(overgangs)toestand bereiken, die met behulp van de uitvoerfunctievan die automaat, de juiste boomtoestand weergeeft.

Voorbeeld 2 Een boomautomaat voor de taal uit Voorbeeld 1, wordt grafisch voor-gesteld in Figuur 2. De overgangstoestanden worden voorgesteld door cirkels, diede uitvoer (boomtoestand) voor die overgangstoestand bevatten. Als deze boomtoe-


Figuur 2: Een boomautomaat die de taal uit Voorbeeld 1 accepteert.

stand een accepterende toestand is, wordt de overgangstoestand met een dubbelecirkel getekend. Initiele toestanden worden aangegeven met het bijhorende invoer-symbool.

Vergelijken we onze nieuwe voorstelling voor de overgangsfunctie met bestaandevoorstellingen, blijkt dat een aantal van die alternatieve voorstellingen (zie (Cris-tau et al. 2005; Raeymaekers and Bruynooghe 2004; Carme et al. 2004)) kunnenbeschouwd worden als een subset van onze voorstelling, terwijl een andere voor-stelling (Kosala et al. 2003; Neven 2002) duidelijk minder efficient is.

Een Minimale Equivalente Boomautomaat

Automaten zijn equivalent als ze dezelfde uitvoer genereren voor elke mogelijkeinvoer. Equivalente automaten hebben niet noodzakelijk evenveel toestanden.

Definitie 3 (Equivalente Automaten) Twee boomautomaten, T1 en T2, zijnequivalent: T1 ≡ T2 ⇔ ∀t ∈ T (Σi) : φ1(δ1(t)) = φ2(δ2(t)).

Analoog aan de stelling van Myhill-Nerode voor string automaten, kan bewezenworden dat voor elke boomautomaat een unieke equivalente automaat bestaat, dieeen minimaal aantal boomtoestanden heeft over alle automaten die equivalent zijnmet de oorspronkelijke automaat. Dit kan bewezen worden onafhankelijk van devoorstelling van de overgangsfunctie.

De overgangsfunctie van een boomautomaat met een minimaal aantal boom-toestanden, kan voorgesteld worden door verschillende equivalente stringautoma-ten. Voor deze automaten bestaat ook een minimale equivalente stringautomaat.Voor deze minimale stringautomaat geldt dat deze een minimaal aantal toestan-den heeft in vergelijking met alle mogelijke overgangsfuncties van de equivalenteboomautomaten. Dus niet alleen van de minimale boomautomaat.

Een minimale equivalente automaat met een minimale equivalente overgangs-functie bestaat dus en is uniek. Om deze te vinden, maken we in de set van boom-toestanden en van overgangstoestanden, partities van equivalente toestanden. De

2 Boomautomaten en Wrapperrepresentatie NL 7

equivalentieklassen van deze partities gebruiken we als de boomtoestanden en deovergangstoestanden van de minimale boomautomaat.

Wrappers Voorstellen als Boomautomaten

Een gemarkeerd document is een document met een merkteken op sommige vanzijn elementen. Voor een ongemarkeerd document bestaan verschillende gemar-keerde versies (of markeringen). We zeggen dat een markering correct is metbetrekking tot een extractietaak, wanneer elk merkteken van de markering op eendoelelement van de extractietaak staat. Merk op dat een ongemarkeerd document(zonder merktekens) volgens deze definitie ook een correcte markering is. Eenmarkering is een volledig correcte markering wanneer het een correcte markeringis waarin elk doelelement van de extractietaak een merkteken heeft.

Automaten over een invoeralfabet, dat zowel symbolen zonder als met merk-tekens bevat, kunnen gebruikt worden om talen van gemarkeerde documentente beschrijven. Zo kunnen we automaten definieren voor de taal van correctemarkeringen voor een extractietaak (PCM herkenner) of voor de volledig correc-te markeringen voor een extractietaak (CCM herkenner), of zelfs voor de correctemarkeringen voor een extractietaak, die maximaal een merkteken bevatten (ESCMherkenner). Elk van deze automaten is gedefinieerd voor een specifieke extractie-taak, we zullen dit verder niet meer expliciet vermelden.

Deze automaten kunnen we gebruiken voor extractie. Om de doelelementenin een document terug te vinden, zouden we alle mogelijke markeringen kunnenuitproberen met een CCM herkenner. De ene markering die geaccepteerd wordtgeeft dan de doelelementen aan. Het gaat hier echter over 2n markeringen, met nhet aantal elementen in het document. Een andere mogelijkheid bestaat erin allemogelijke markeringen met slechts een merkteken uit te proberen met een PCMof een ESCM herkenner. Hiervoor moet slechts n maal de herkenner gebruiktworden.

Een efficiente extractie kan gebeuren in 1 verwerking met behulp van een CCMherkenner. Het uitvoeringsmechanisme wordt hiervoor op twee manieren aange-past. De eerste aanpassing zorgt ervoor dat wordt bijgehouden welke doelelemen-ten tijdens de verwerking worden tegengekomen. De tweede aanpassing beschouwdelk element in het document zowel gemarkeerd als ongemarkeerd, en zal de her-kenner in verschillende threads parallel uitvoeren over alle mogelijke markeringen.De toestanden in de verschillende threads zullen vaak samenvallen, waardoor hetaantal threads beperkt blijft. Dit wordt nog versterkt doordat threads die nietovereenkomen met de volledig correcte markering vaak vroeg geelimineerd kunnenworden. Om deze reden is het nodig om een CCM herkenner te gebruiken, dezeaccepteerd immers slechts een markering voor een gegeven document.

Een leeralgoritme werkt typisch met onvolledig gemarkeerde voorbeelden. Hetresultaat zal dus vaak een herkenner zijn die een superset van ESCM accepteert,maar niet gegarandeerd PCM accepteert. Een efficiente extractie vraagt daaren-


tegen weer een CCM herkenner. Er is dus een vraag naar operaties die een typecorrecte markeringherkenner naar een ander type kunnen omzetten. Zo een om-zetter kan gemaakt worden door het ene type te simuleren met het andere type endan een nieuwe automaat te genereren die equivalent is met die simulatie.

3 Inductie van Boomtalen

In deze sectie definieren we een subklasse van de reguliere boomtalen, die hetmogelijk maakt om een boomtaal te leren uit enkel positieve voorbeelden. Webeschouwen ook hoe een boomautomaat, die een dergelijke taal accepteerd, geleerdkan worden.

(k, l)-Contextuele Boomtalen

Het idee achter (k, l)-contextuele boomtalen, komt van de subklassen gebruikt ink-contextuele (string)talen (Muggleton 1990) en k-testbare talen (Garcıa and Vi-dal 1990). Intuıtief kunnen we stellen dat een contextuele taal gedefinieerd wordtdoor een representatief set van bouwblokken. De taal bestaat uit alle objectendie met een subset van deze bouwblokken kan opgebouwd worden. Om de taal teleren moeten we bouwblokken van voorbeelden verzamelen, totdat we een repre-sentatieve verzameling hebben. Merk op dat de grootte van de bouwblokken wordtaangegeven met parameters k en l. We hebben dus meerdere sets van bouwblokkengedefinieerd.

We definieren ‘(k, l)-forks’ als mogelijke bouwstenen van bomen. De set (k, l)-forks van een boom, zijn alle bomen die gevormd kunnen worden door die boomof een van zijn deelbomen te nemen en alles wat dieper ligt dan l niveaus weg teknippen en daarnaast op elk niveau kinderen weg te knippen, todat uiteindelijkslechts k opeenvolgende kinderen overblijven (of niets wegknippen als er oorspron-kelijk k of minder kinderen zijn). De verzameling van alle (k, l)-forks van een boomnoteren we als F(k,l)(t).

Definitie 4 De (k, l)-contextuele boomtaal met een representatief set van bomenG, is gedefinieerd als L(k,l)(G) = {t ∈ T (Σ) | F(k,l)(t) ⊆ G}.Om een (k, l)-contextuele taal te leren van een aantal voorbeeldbomen, verza-melen we alle bouwblokken van die voorbeelden in de representatieve set: G =⋃

t∈P F(k,l)(t), met P de verzameling van positieve voorbeelden.

Voorbeeld 3 In Figuur 3 vinden we een boom t. Naast deze boom wordt de setvan (3, 3)-forks van deze boom afgebeeld. We kunnen t als enig voorbeeld beschou-wen om een (3, 3)-contextuele taal te leren. Hiervoor nemen we gewoon zijn setvan (3, 3)-forks als representatief set. De geleerde taal is L(3,3)(F(3,3)({t})). Detwee bomen rechts in Figuur 3 behoren tot deze taal aangezien hun (3, 3)-forks

3 Inductie van Boomtalen NL 9

b c b c

a

b c b c b

a

t

a

b c b

a

a

c b c

a

a

b c b

a

b c b

a

b

c b c

a

b

b c b

a

c b c

a b c b c b c b c b c b

a

c b c

a

c b c

a

c b c

a

Figuur 3: Een boom t met links de set van (3, 3)-forks van die boom en rechtsvoorbeelden van bomen die behoren tot de taal die die set als representatieve setgebruikt.

een subset zijn van de representatieve set. De (3, 3)-forks van de linkse boom zijnbijvoorbeeld de vier laatste forks van de tweede rij.

(k, l)-Contextuele Boomautomaten

Als we op een efficiente manier willen testen of een boom tot een (k, l)-contextueletaal behoort, of bewerkingen op verschillende talen willen uitvoeren, is het interes-sant om een boomautomaat te hebben die de (k, l)-contextuele taal herkent. Wekunnen zo een automaat in twee stappen rechtstreeks leren van dezelfde voorbeel-den.

In een eerste stap leren we een herkenner die elke boom uit de representatieveset van de (k, l)-contextuele taal herkent (Dit zijn de (k, l)-forks van de voorbeel-den. We leren deze automaat incrementeel, door alle forks door de automaat telaten verwerken. Als er een overgang ontbreekt om de fork te verwerken, wordtdeze tijdens de leerfase toegevoegd, samen met een nieuwe overgangstoestand eneventuele boomtoestand (als de fork volledig is). Voor efficientie redenen wor-den de forks van een boom in parallel verwerkt, zodat elk voorbeeld in een keerverwerkt kan worden.

In een tweede stap wordt de fork-setherkenner omgezet naar een automaat diede bijhorende (k, l)-contextuele taal herkent. Hiervoor simuleren we het verwer-ken van alle forks in parallel (ditmaal zonder incrementele update) door de fork-setherkenner. Als resultaat wordt een boomautomaat gegenereerd die equivalentis met die simulatie.


4 Inductie van Wrappers

In Sectie 2 hebben we gezien hoe we een boomautomaat kunnen gebruiken alswrappers voor extractie, in Sectie 3 hebben we gezien hoe we in het algemeen eenboomautomaat kunnen leren van een aantal positieve voorbeelden. In deze sectiecombineren we de twee, en zien we hoe een wrapper kan geleerd worden, en hoewe dit kunnen inpassen in een praktische applicatie.

Praktische Aanpassingen voor Wrapperinductie

Zoals in Sectie 2 is voorgesteld, kan een wrapper een boomautomaat zijn, dieenkel correct gemarkeerde HTML documenten aanvaard, en die geleerd wordt uitcorrect gemarkeerde voorbeelden. Aangezien er oneindig veel tekstelementen ineen HTML kunnen voorkomen, hebben we echter heel veel voorbeelden nodig omzo een wrapper te leren. Om die reden gaan we een voorbereidende bewerking doenop de data, die de tekstelementen vervangt door een speciaal teken. Door dezeveralgemening kunnen soms tekstelementen die nodig zijn om een doelelement teonderscheiden van een ander kandidaat element verdwijnen. Deze tekstelementennoemen we karakteriserende context, en aan de voorbereidende bewerking kaneen verzameling van zulke elementen worden meegegeven, die dan niet vervangenworden.

Verder merken we op dat de verzameling van (k, l)-forks opgedeeld is in tweeklassen, een klasse met gemarkeerde forks en een klasse met ongemarkeerde forks,waarvan enkel de klasse van gemarkeerde forks effectief gebruikt wordt om tebepalen of een element een doelelement is. Daarom gebruiken we, in de praktijk,enkel de gemarkeerde forks in wrappers. Dit heeft als extra voordeel dat er mindervoorbeelden nodig zijn, omdat we niet alle ongemarkeerde forks moeten leren.

Voorbeeld 4 Gegeven een document met persoonsgegevens. Voor elke persoonwordt een lijst weergegeven (ul) met als eerste element in de lijst, de naam vande persoon. De naam van de persoon staat in het vet en is voorafgegaan door:‘naam:’. De extractietaak is om alle namen uit een document te verzamelen. Alswe ‘naam:’ als karakteriserende context gebruiken, ziet een voorbeelddocument met1 merkteken er uit als in Figuur 4(links).

Met als parameters k = 2 en l = 3 en karakteriserende context={‘naam:’}, kaneen wrapper geleerd worden voor deze extractietaak. De representatieve set voordeze wrapper (de gemarkeerde (2, 3)-forks) wordt getoond in Figuur 4(rechts).

Een boomautomaat voor deze wrapper kan verkregen worden op dezelfde manierals beschreven in Sectie 3. Een aanpassing is nodig zodanig dat de gemarkeerdefork-setherkenner geleerd wordt. Voor de conversie in de tweede stap, moeten hierdan nog alle mogelijke ongemarkeerde forks bij geteld worden. Praktisch doen wedit door de unie te nemen met een boomautomaat die elke ongemarkeerde boomaccepteert.

4 Inductie van Wrappers NL 11

@:N @ @ @

’name:’ b @ b ’name:’ b @ b

li li li li

ul ul

body

html

@:N,@:N

b,

@:N

’name:’ b

li

Figuur 4: Een positief voorbeeld na de voorbereidende bewerking (alle tekstele-menten, buiten de karakteriserende context ‘naam:’, zijn vervangen door ‘@’),samen met zijn gemarkeerde (2, 3)-forks.

Parameters Bepalen en Implementatie

De optimale combinatie van parameters verschilt van extractietaak tot extractie-taak. Wanneer we van dezelfde voorbeelden leren, en we vergroten een van deparameters k en l, wordt de geleerde taal specifieker. Dat wil zeggen dat slechtseen subset aanvaard wordt van de elementen die door de algemenere taal aan-vaard worden. Aangezien, ongeacht van de parameterwaarden, steeds alle positie-ve voorbeelden aanvaard zullen worden, is het onmogelijk om te beslissen welkeparameters optimaal zijn aan de hand van positieve voorbeelden alleen.

We leren daarom de beste parameterwaarden, gegeven een kleine verzamelingvan negatieve voorbeelden. We doen dit door de parameters zodanig te kiezendat de meest algemene taal bekomen wordt die nog alle negatieve voorbeeldenverwerpt. Door een leermethode gebaseerd op enkel positieve voorbeelden te com-bineren met het bepalen van de parameters aan de hand van negatieve voorbeel-den, wordt het voordeel dat er slechts weinig voorbeelden nodig zijn om te lerenbehouden, terwijl de subklasse van boomtalen toch voldoende expressief blijkt.

Een incrementele, interactieve implementatie van dit algoritme is geimplemen-teerd in een applicatie met een grafische gebruikersinterface. Een gebruiker kaneen positief voorbeeld aangeven door op een ongemarkeerd tekstelement te klik-ken. De applicatie geeft steeds weer welke elementen door de huidige wrappergeaccepteerd worden. Als de gebruiker een foutieve extractie opmerkt kan hieropgeklikt worden om een negatief voorbeeld aan te geven, waarop de wrapper (an-dere parameters) aangepast wordt. Deze werkwijze zorgt ervoor dat de gebruikerenkel de verkeerdelijk negatieve of de verkeerdelijk positief gemarkeerde extractieskan aangeven en dit zijn de meest informatieve voorbeelden voor het leerproces.


5 Experimenten

We hebben experimenteel geevalueerd hoe ons leeralgoritme presteert ten opzichtevan andere state of the art algoritmen die eveneens leren van positieve voorbeel-den alleen. We hebben vergeleken met twee systemen die stringgebaseerd werken:BWI (Freitag and Kushmerick 2000) en STALKER (Muslea et al. 2001) en verdermet een systeem dat ook werkt met bomen: het Local Unranked Tree Inferencealgoritme (Kosala et al. 2003). De resultaten geven duidelijk aan dat ons leeral-goritme beter leert, dan de anderen, vooral voor een klein aantal voorbeelden.

Verder hebben we ook onze interactieve aanpak vergeleken met een interactieveuitbreiding van het STALKER algoritme: STALKER met Co-Testing (Musleaet al. 2003). Ook hier blijkt ons algoritme beter te presteren wat snelheid van hetinductiealgoritme en aantal interacties door de gebruiker betreft. Met een anderinteractief systeem: SQUIRREL (Carme et al. 2007), hebben we geen extensieveexperimenten gedaan. Maar gerapporteerde resultaten op een paar overlappendedatasets geven zeker geen aanwijzing dat ons algoritme minder presteert.

6 Besluit

In deze thesis gaven we een nieuwe voorstelling voor de overgangsfunctie van eenboomautomaat, met gunstigere eigenschappen dan alternatieve voorstellingen. Webespraken het bestaan en de uniekheid van een minimaal equivalente boomauto-maat met tevens een minimale overgangsfunctie. Daarnaast hebben we een wrap-pervoorstelling voorzien, die extractie toelaat door het herkennen van een taal vancorrect gemarkeerde documenten. Voor praktisch gebruik hebben we ook beschre-ven hoe automaten die verschillende types van correct gemarkeerde documentenaccepteren naar elkaar kunnen omgezet worden.

We hebben een algemeen boomtaal-inductiealgoritme beschreven dat kan le-ren van een verzameling met alleen positieve voorbeelden. Hiervoor werd eensubklasse van de reguliere boomtalen gedefinieerd: de klasse van (k, l)-contextueleboomtalen. Hierin wordt een taal gedefinieerd door een verzameling van alle moge-lijke bouwblokken die in de elementen van de taal kunnen voorkomen. Daarnaasthebben we een algoritme besproken dat in twee stappen rechtstreeks een boomau-tomaat leert, die de geleerde (k, l)-contextuele boomtaal accepteert.

Aan voorgaand inductiealgoritme hebben we specifieke aanpassingen gemaaktom wrappers voor informatie extractie te leren. We gaven aan hoe goede para-meters voor de (k, l)-contextuele boomtalen kunnen bepaald worden aan de handvan een verzameling van negatieve voorbeelden en implementeerden dit als eenincrementeel algoritme in een applicatie met een grafische gebruikersinterface. Wehebben ons algoritme experimenteel vergeleken met andere state of the art bena-deringen, met zeer gunstige resultaten.

Als besluit stellen wij dus dat het doel van deze thesis, een nieuwe benadering

BIBLIOGRAFIE NL 13

van informatie extractie uit webpagina’s, met verbeterde resultaten, inderdaadbehaald is. Daarnaast zijn we er ook in geslaagd een betere voorstelling voorboomautomaten te definieren, die een efficiente extractie toelaat.

Bibliografie

Carme, J., R. Gilleron, A. Lemay, and J. Niehren (2007). Interactive learningof node selecting tree transducers. Machine Learning 66 (1), 33–67.

Carme, J., J. Niehren, and M. Tommasi (2004). Querying unranked trees withstepwise tree automata. In International Conference on Rewriting Techni-ques and Applications, Aachen, pp. 105–118.

Cristau, J., C. Loding, and W. Thomas (2005). Deterministic automata onunranked trees. In Proceedings of the 15th International Symposium on Fun-damentals of Computation Theory, FCT 2005, pp. 68–79.

Freitag, D. and N. Kushmerick (2000). Boosted wrapper induction. In Pro-ceedings of the Seventeenth National Conference on Artificial Intelligenceand Twelfth Innovative Applications of AI Conference, pp. 577–583. AAAIPress.

Garcıa, P. and E. Vidal (1990). Inference of k-testable languages in the strictsense and application to syntactic pattern recognition. IEEE Trans. PatternAnal. Mach. Intell. 12 (9), 920–925.

Kosala, R. (2003). Information Extraction by Tree Automata Inference. Ph. D.thesis, Department of Computer Science, Katholieke Universiteit Leuven.

Kosala, R., M. Bruynooghe, H. Blockeel, and J. Van den Bussche (2003). Infor-mation extraction from web documents based on local unranked tree auto-maton inference. In Intl. Joint Conference on Artificial Intelligence (IJCAI),pp. 403–408.

Muggleton, S. (1990). Inductive Acquisition of Expert Knowledge. Addison-Wesley.

Muslea, I., S. Minton, and C. Knoblock (2001). Hierarchical wrapper inductionfor semistructured information sources. Journal of Autonomous Agents andMulti-Agent Systems 4, 93–114.

Muslea, I., S. Minton, and C. Knoblock (2003). Active learning with strong andweak views: A case study on wrapper induction. In Intl. Joint Conferenceon Artificial Intelligence (IJCAI), pp. 415–420.

Neven, F. (2002). Automata theory for XML researchers. SIGMOD Rec. 31 (3),39–46.

Raeymaekers, S. and M. Bruynooghe (2004). Minimization of finite unrankedtree automata. Manuscript.