xduce tabuchi naoshi, m1, yonelab. ([email protected])
TRANSCRIPT
Presentation Outline
XDuce: Introduction Regular Expression Types Regular Expression Pattern Matching Algorithms for Pattern Matching Type Inference Conclusion / Future Works References Xperl(?)
XDuce: For What?
A functional language for XML processing On the basis of
Regular Expression Types, andPattern Matching
Statically Typedi.e. Outputs are statically checked against DTD-conformance etc.
Advantages (vs. “untyped”)
“Untyped” XML processing: programs using DOM etc.Little connection between program and XML s
chemaValidity can be checked only at run-time, if an
y
Advantages (vs. “embedding”)
“Embedding” : mapping XML schema into language’s type system.
e.g.
<!ELEMENT person (name, mail*, tel?)> (DTD)
↓
type person = Person of name * mail list * tel option (ML)
Advantages (vs. “embedding”)
Embedding does not suit intuition in some cases.
e.g.
Intuitively… (name, mail*, tel?) <: (name, mail*, tel*)but not name * mail list * tel option <: name * mail list * tel list
Language Features (1/2)
ML-like pattern matchinge.g.
match p with| person(name[n], (ms as mail*), tel[t])
-> (* case: p has a tel *)| person(name[n], (ms as mail*))
-> (* case: p has no tel *)…
Language Features (2/2)
Type inferencee.g. if
type Person = person[name[String],mail*, tel[String]?] and
p :: Personthen
match p with person[name[n], (ms as mail*)]⇒ n :: String, ms :: mail* are inferred.
Applications
Bookmarks (Mozilla bookmark extraction) Html2Latex Diff (diff for XML) All 300 – 350 lines.
Regular Expression Types
Types are defined in regular expression form with labelsConcatanation, alternation, repetition as basic
constructorsLabels correspond to elements of XML
(person, name, mail, etc…)
Syntax of Types
T ::= () | X | l[T]| T, T (* concat. *)| T|T (* alt. *)| T* (* rep. *)
whereX : Type Variablesl : Labels
Recursive Types
Types can be (mutually) recursive.e.g.
type Folder = Entry*type Entry = name[String],
file[File] |name[String], folder[Folder]
Subtyping
Meaning of subtypes is as usual:All values t of T are also values of T’
T <: T’ ⇔ t ∈ T ⇒ t ∈ T’
Subtagging
Subtaggings are user-defined “ad-hoc” subtype relation between labelse.g.
small <: font… <small> tag is a special case of <font> tag (in HTML)
Complexity of Subtyping
Subtype relation (T <: T’) is equivalent to inclusion of CFGs: Undecidable!
Need some restrictions on syntax (next slide…)
Well-formedness of Types
Syntactic restriction on types to ensure “regularity”
Recursive use of types can only occurat the tail position of type definition, or inside labels.
Well-formed Types: Examples
type X = Int, Ytype Y = String, X | ()
andtype Z = String, lab[Z], String | ()
are well-formed, buttype U = Int, U, String | ()
is not.
Complexity of Subtyping, again
With well-formedness, checking subtype relation is:Still EXPTIME-complete, butacceptable in practical cases.
Pattern Matching
Pattern match can also involve regular expression types.
e.g.match p with| person[name[n], (ms as mail*), (t as tel?)-> …
Policies of Pattern Matching
Pattern matching has two basic policies:First-match (as in ML):
only the first pattern matched is takenLongest-match (as usual in regexp. matching
on string):matching is done as much as possible
First-match: Example
(* p = person[name[…], mail, tel[…]] *)match p with| person(name[n], (ms as mail*), tel
[t])-> (* invoked *)
| person(name[n], (ms as mail*), (tl as tel?)
-> (* not invoked *)
Longest-match: Example
(* p = person[name mail, mail, tel] *)match p with| … (m1 as mail*), (m2 as mail*), …
-> (*m1 = mail, mailm2 = () *)
Exhaustiveness and Redundancy
Pattern matches are checked against exhaustiveness and redundancy.Exhaustiveness: No “omission” of valuesRedundancy: Never-matched patterns
Exhaustiveness
A pattern match P1 -> e1 | … | Pn -> en is exhaustive (wrt. input type T)⇔All values t ∈ T are matched by some Pi
orT <: P1 | … | Pn
Exhaustiveness: Example (1/2)
match p with| person[name[n], (ms as mail*), tel[t]]
-> ...| person[name[n], (ms as mail*)]
-> ...is exhaustive patterns (wrt. Person)
Exhaustiveness: Example (2/2)
match p with| person[name[n], (ms as mail*), tel[t]]
-> ...| person[name[n], (ms as mail+)]
-> ...is NOT exhaustive (wrt. Person):
person[name[...]] does not match
Redundancy
A pattern Pi is redundant in P1 -> e1 | … | P
n -> en (wrt. input type T)⇔All values matched by Pi is matched by P1 | ... | Pi-1
Redundancy: Example
match p with| person[name[n], (ms as mail*), (tl as tel?)]
-> ...| person[name[n], (ms as mail*)]
-> ...Second pattern is redundant:
anything match second pattern also match first one.
Algorithms for Pattern Matching
Pattern matching takes following stepsTranslation of values into internal forms
(binary trees)Translation of types and patterns into internal
forms (binary trees and tree automata)Values are matched by patterns, in terms of
tree automata
Internal Forms of Values
Values are represented as binary trees internally
t ::= ε (* leaves)| l(t, t) (* labels *)
First node is content of the label, second is remainder of the sequence.
Internal Forms of Values: Example
person[name[], mail[], mail[]]is translated into
person(name(ε, mail(ε, mail(ε, ε))), ε)
Internal Forms of Types
Types are also translated into binary treesT ::= φ (* empty *)
| ε (* leaves *)| T|T| l(X, X)
X is States, used in tree automata
Internal Forms of Types: Tree Automata A tree automaton M is a mapping of
States -> Typese.g.
M(X) = name(Y, Z)M(Y) = εM(Z) = mail(Y, Z) | ε ...
Internal Forms of Types: Example
type Person = person[name[], mail*, tel[]?]is translated into
binary tree: person(X1, X0) and tree automaton M, s.t.
M(X0) = εM(X1) = name(X0, X2),M(X2) = mail(X0, X2) | mail(X0, X3) | εM(X3) = tel(X0, X0)
Internal Forms of Patterns
Patterns are similar to types, with some additions
P ::= (* same as types... *)| x : P (* x as P *)| T (* wildcard *)
Wildcards are used for non “as”-ed variables
Internal Forms of Patterns: Example Pattern
person[name[n], (ms as mail*)]is translated into binary tree
person(Y1, Y0)and tree automaton N, s.t.
N(Y0) = εN(Y1) = name(n:T, ms:Y2)N(Y2) = mail(Y0, Y2) | ε
Pattern Matching (1/3)
Pattern matching has two rolesmatch input values (of course!)bind variables to components of input value, if
matched Written formally
t ∈ D ⇒ V“t is matched by D, yielding V” (V : Vars -> Values)
Pattern Matching (2/3)
Matching relation t ∈ D ⇒ V is defined by following rules... (next slide)
Assumptions:D is a set of patterns and statesA tree automaton N is implied(D, N) corresponds to the external pattern
Pattern Matching (3/3)
212121
222111
21
21
21
1
),(),(
|
|
}{:
)(
VVYYlttl
VYtVYt
VPPt
VPtPt
VPPt
VPtTt
txVPxt
VPt
VYt
VYNt
Type Inference (1/2)
Infer types of variables in patterns Results are exact types of variables Type of each variable depends on
pattern itself, and type of input
Type Inference (2/2)
Type inference is “flow-sensitive” In P1 -> e1 | … | Pn -> en , inference on Pi de
pends on P1 ... Pi-1
Because…Values matched by Pi are those NOT matched
by P1 ... Pi-1
Type Inference: Example (1/2)
(* p :: person[name[], mail*, tel[]?] *) match p with| person[name[], rest] -> …
Type of rest inferred ismail*, tel[]?
In this case
Type Inference: Example (2/2)
match p with| person[name[], tel[]] -> …| person[name[], rest] -> …
Type of rest becomes(mail+, tel[]?) | ()
In this case, because…person[name[], (), tel[]]
Is matched by the first pattern
Type Inference: Limitations
“Exact” type inference is possible only onVariables at tail position, or Inside labels (c.f. well-formedness)
Limitation comes from internal representation of patterns (binary trees)
Conclusion
Expressiveness of regular expression types/pattern matching are useful for XML processing.
Type inference (including subtype relation) is possible and efficient (in most practical cases).
Future Works
Precise type inference on all variables Introducing Any type: Not possible by
naïve wayBreaks closure-property of tree
automataMakes type inference impossible
References
Regular Expression Pattern Matching for XML: Hosoya and Pierce
Regular Expression Types for XML: Hosoya, Vouillon, and Pierce
Available @ http://xduce.sourceforge.net/papers.html
Xperl(?)
My own current research Regular expression types for Perl Motivation: Scripting languages
are used more widelywill live longer
than XML
Features (in mind)
Regular expression (but not tree) types Infer outputs of scripts, etc. Detect possible run-time errors
Progress Report (1/3)
Parsing: Nightmare! ASTs can be extracted through debug inte
rface, fortunately :-p
Progress Report (2/3)
Semantics: No specification but implementation
Trying from scratch, step by step Queer, esp. around side-effects and data
structures First attempt in the world?
Progress Report (3/3)
Type System: Working along with semantics
Types are regular expressions:
τ ::= ε|α| ττ | τ|τ | τ* … Preliminary implementation of inference Still VERY trivial...
Resources
No documentations yet. Working note is placed @
http://tabee.com/private/lab/xperl/defn.dvi
AS-IS.