xduce tabuchi naoshi, m1, yonelab. ([email protected])

XDuce

Tabuchi Naoshi, M1, Yonelab.

([email protected])

Presentation Outline

XDuce: Introduction Regular Expression Types Regular Expression Pattern Matching Algorithms for Pattern Matching Type Inference Conclusion / Future Works References Xperl(?)

XDuce: For What?

A functional language for XML processing On the basis of

Regular Expression Types, andPattern Matching

Statically Typedi.e. Outputs are statically checked against DTD-conformance etc.

Advantages (vs. “untyped”)

“Untyped” XML processing: programs using DOM etc.Little connection between program and XML s

chemaValidity can be checked only at run-time, if an

y

Advantages (vs. “embedding”)

“Embedding” : mapping XML schema into language’s type system.

e.g.

<!ELEMENT person (name, mail*, tel?)> (DTD)

↓

type person = Person of name * mail list * tel option (ML)

Advantages (vs. “embedding”)

Embedding does not suit intuition in some cases.

e.g.

Intuitively… (name, mail*, tel?) <: (name, mail*, tel*)but not name * mail list * tel option <: name * mail list * tel list

Language Features (1/2)

ML-like pattern matchinge.g.

match p with| person(name[n], (ms as mail*), tel[t])

-> (* case: p has a tel *)| person(name[n], (ms as mail*))

-> (* case: p has no tel *)…

Language Features (2/2)

Type inferencee.g. if

type Person = person[name[String],mail*, tel[String]?] and

p :: Personthen

match p with person[name[n], (ms as mail*)]⇒ n :: String, ms :: mail* are inferred.

Applications

Bookmarks (Mozilla bookmark extraction) Html2Latex Diff (diff for XML) All 300 – 350 lines.

Regular Expression Types

Types are defined in regular expression form with labelsConcatanation, alternation, repetition as basic

constructorsLabels correspond to elements of XML

(person, name, mail, etc…)

Syntax of Types

T ::= () | X | l[T]| T, T (* concat. *)| T|T (* alt. *)| T* (* rep. *)

whereX : Type Variablesl : Labels

Recursive Types

Types can be (mutually) recursive.e.g.

type Folder = Entry*type Entry = name[String],

file[File] |name[String], folder[Folder]

Subtyping

Meaning of subtypes is as usual:All values t of T are also values of T’

T <: T’ ⇔ t ∈ T ⇒ t ∈ T’

Subtagging

Subtaggings are user-defined “ad-hoc” subtype relation between labelse.g.

small <: font… <small> tag is a special case of <font> tag (in HTML)

Complexity of Subtyping

Subtype relation (T <: T’) is equivalent to inclusion of CFGs: Undecidable!

Need some restrictions on syntax (next slide…)

Well-formedness of Types

Syntactic restriction on types to ensure “regularity”

Recursive use of types can only occurat the tail position of type definition, or inside labels.

Well-formed Types: Examples

type X = Int, Ytype Y = String, X | ()

andtype Z = String, lab[Z], String | ()

are well-formed, buttype U = Int, U, String | ()

is not.

Complexity of Subtyping, again

With well-formedness, checking subtype relation is:Still EXPTIME-complete, butacceptable in practical cases.

Pattern Matching

Pattern match can also involve regular expression types.

e.g.match p with| person[name[n], (ms as mail*), (t as tel?)-> …

Policies of Pattern Matching

Pattern matching has two basic policies:First-match (as in ML):

only the first pattern matched is takenLongest-match (as usual in regexp. matching

on string):matching is done as much as possible

First-match: Example

(* p = person[name[…], mail, tel[…]] *)match p with| person(name[n], (ms as mail*), tel

[t])-> (* invoked *)

| person(name[n], (ms as mail*), (tl as tel?)

-> (* not invoked *)

Longest-match: Example

(* p = person[name mail, mail, tel] *)match p with| … (m1 as mail*), (m2 as mail*), …

-> (*m1 = mail, mailm2 = () *)

Exhaustiveness and Redundancy

Pattern matches are checked against exhaustiveness and redundancy.Exhaustiveness: No “omission” of valuesRedundancy: Never-matched patterns

Exhaustiveness

A pattern match P1 -> e1 | … | Pn -> en is exhaustive (wrt. input type T)⇔All values t ∈ T are matched by some Pi

orT <: P1 | … | Pn

Exhaustiveness: Example (1/2)

match p with| person[name[n], (ms as mail*), tel[t]]

-> ...| person[name[n], (ms as mail*)]

-> ...is exhaustive patterns (wrt. Person)

Exhaustiveness: Example (2/2)

match p with| person[name[n], (ms as mail*), tel[t]]

-> ...| person[name[n], (ms as mail+)]

-> ...is NOT exhaustive (wrt. Person):

person[name[...]] does not match

Redundancy

A pattern Pi is redundant in P1 -> e1 | … | P

n -> en (wrt. input type T)⇔All values matched by Pi is matched by P1 | ... | Pi-1

Redundancy: Example

match p with| person[name[n], (ms as mail*), (tl as tel?)]

-> ...| person[name[n], (ms as mail*)]

-> ...Second pattern is redundant:

anything match second pattern also match first one.

Algorithms for Pattern Matching

Pattern matching takes following stepsTranslation of values into internal forms

(binary trees)Translation of types and patterns into internal

forms (binary trees and tree automata)Values are matched by patterns, in terms of

tree automata

Internal Forms of Values

Values are represented as binary trees internally

t ::= ε (* leaves)| l(t, t) (* labels *)

First node is content of the label, second is remainder of the sequence.

Internal Forms of Values: Example

person[name[], mail[], mail[]]is translated into

person(name(ε, mail(ε, mail(ε, ε))), ε)

Internal Forms of Types

Types are also translated into binary treesT ::= φ (* empty *)

| ε (* leaves *)| T|T| l(X, X)

X is States, used in tree automata

Internal Forms of Types: Tree Automata A tree automaton M is a mapping of

States -> Typese.g.

M(X) = name(Y, Z)M(Y) = εM(Z) = mail(Y, Z) | ε ...

Internal Forms of Types: Example

type Person = person[name[], mail*, tel[]?]is translated into

binary tree: person(X1, X0) and tree automaton M, s.t.

M(X0) = εM(X1) = name(X0, X2),M(X2) = mail(X0, X2) | mail(X0, X3) | εM(X3) = tel(X0, X0)

Internal Forms of Patterns

Patterns are similar to types, with some additions

P ::= (* same as types... *)| x : P (* x as P *)| T (* wildcard *)

Wildcards are used for non “as”-ed variables

Internal Forms of Patterns: Example Pattern

person[name[n], (ms as mail*)]is translated into binary tree

person(Y1, Y0)and tree automaton N, s.t.

N(Y0) = εN(Y1) = name(n:T, ms:Y2)N(Y2) = mail(Y0, Y2) | ε

Pattern Matching (1/3)

Pattern matching has two rolesmatch input values (of course!)bind variables to components of input value, if

matched Written formally

t ∈ D ⇒ V“t is matched by D, yielding V” (V : Vars -> Values)


Matching relation t ∈ D ⇒ V is defined by following rules... (next slide)

Assumptions:D is a set of patterns and statesA tree automaton N is implied(D, N) corresponds to the external pattern


212121

222111

21

21

21

1

),(),(

|

|

}{:

)(

VVYYlttl

VYtVYt

VPPt

VPtPt

VPPt

VPtTt

txVPxt

VPt

VYt

VYNt

Type Inference (1/2)

Infer types of variables in patterns Results are exact types of variables Type of each variable depends on

pattern itself, and type of input

Type Inference (2/2)

Type inference is “flow-sensitive” In P1 -> e1 | … | Pn -> en , inference on Pi de

pends on P1 ... Pi-1

Because…Values matched by Pi are those NOT matched

by P1 ... Pi-1

Type Inference: Example (1/2)

(* p :: person[name[], mail*, tel[]?] *) match p with| person[name[], rest] -> …

Type of rest inferred ismail*, tel[]?

In this case

Type Inference: Example (2/2)

match p with| person[name[], tel[]] -> …| person[name[], rest] -> …

Type of rest becomes(mail+, tel[]?) | ()

In this case, because…person[name[], (), tel[]]

Is matched by the first pattern

Type Inference: Limitations

“Exact” type inference is possible only onVariables at tail position, or Inside labels (c.f. well-formedness)

Limitation comes from internal representation of patterns (binary trees)

Conclusion

Expressiveness of regular expression types/pattern matching are useful for XML processing.

Type inference (including subtype relation) is possible and efficient (in most practical cases).

Future Works

Precise type inference on all variables Introducing Any type: Not possible by

naïve wayBreaks closure-property of tree

automataMakes type inference impossible

References

Regular Expression Pattern Matching for XML: Hosoya and Pierce

Regular Expression Types for XML: Hosoya, Vouillon, and Pierce

Available @ http://xduce.sourceforge.net/papers.html

Xperl(?)

My own current research Regular expression types for Perl Motivation: Scripting languages

are used more widelywill live longer

than XML

Features (in mind)

Regular expression (but not tree) types Infer outputs of scripts, etc. Detect possible run-time errors

Progress Report (1/3)

Parsing: Nightmare! ASTs can be extracted through debug inte

rface, fortunately :-p


Semantics: No specification but implementation

Trying from scratch, step by step Queer, esp. around side-effects and data

structures First attempt in the world?


Type System: Working along with semantics

Types are regular expressions:

τ ::= ε|α| ττ | τ|τ | τ* … Preliminary implementation of inference Still VERY trivial...

Resources

No documentations yet. Working note is placed @

http://tabee.com/private/lab/xperl/defn.dvi

AS-IS.

xduce tabuchi naoshi, m1, yonelab. ([email protected])

Documents