introduction to language theory prepared by manuel e. bermúdez, ph.d. associate professor...

29
Introduction to Language Theory Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Translators

Upload: aubrey-wilkinson

Post on 02-Jan-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Introduction to Language Theory

Prepared by

Manuel E. Bermúdez, Ph.D.Associate ProfessorUniversity of Florida

Programming Language Translators

Introduction to Language TheoryDefinition: An alphabet (or vocabulary) Σ is a

finite set of symbols.

Example: Alphabet of Pascal:+ - * / < … (operators)begin end if var (keywords)<identifier> (identifiers)<string> (strings)<integer> (integers); : , ( ) [ ] (punctuators)

Note: All identifiers are represented by one symbol, because Σ must be finite.

Introduction to Language Theory

Definition: A sequence t = t1t2…tn of symbols from an alphabet Σ is a string.

Definition: The length of a string t = t1t2…tn (denoted |t|) is n. If n = 0, the string is ε, the empty string.

Definition: Given strings s = s1s2…sn and

t = t1t2…tm, the concatenation of s and t, denoted st, is the string s1s2…snt1t2…tm.

Introduction to Language Theory

Note: εu = u = uε, uεv = uv, for any strings u,v (including ε)

Definition: Σ* is the set of all strings of symbols from Σ.

Note: Σ* is called the reflexive, transitive closure of Σ.

Σ* is described by the graph (Σ*, ·), where “·” denotes concatenation, and there is a designated “start” node, ε.

Introduction to Language TheoryExample: Σ = {a, b}.

(Σ*, ·)

Σ* is countably infinite, so can’t compute all of Σ*, and can only compute finite subsets of Σ*, but can compute whether a given string is in Σ*.

ε

a

b

aa

ab

ba

bb

aba

abba

b

ba

a

b

a

b

Introduction to Language Theory

Example: Σ = Pascal vocabulary. Σ* = all possible alleged Pascal

programs, i.e. all possible inputs to Pascal compiler.

Need to specify L Σ*, the correct Pascal programs.

Definition: A language L over an alphabet Σ is a subset of Σ*.

Introduction to Language Theory

Example: Σ = {a, b}.L1 = ø is a languageL2 = {ε} is a languageL3 = {a} is a languageL4 = {a, ba, bbab} is a languageL5 = {anbn / n >= 0} is a language

where an = aa…a, n timesL6 = {a, aa, aaa, …} is a language

Note: L5 is an infinite language, but described finitely.

Introduction to Language Theory

THIS IS THE MAIN GOAL OF LANGUAGE SPECIFICATION :

To describe (infinite) programming languages finitely, and to provide corresponding finite inclusion-test algorithms.

Language Constructors

Definition: The catenation (or product) of two languages L1 and L2, denoted L1L2, is the set

{uv | uL1, vL2}.

Example: L1 = {ε, a, bb}, L2 = {ac, c}

L1L2 = {ac, c, aac, ac, bbac, bbc}

= {ac, c, aac, bbac, bbc}

Language Constructors

Definition: Ln = LL…L (n times), and L0 = {ε}.

Example: L = {a, bb} L3 = {aaa, aabb, abba, abbbb, bbaa, bbabb, bbbba, bbbbbb}

Language ConstructorsDefinition: The union of two languages L1 and L2 is

the set L1 L2 = {u | uL1} { v | vL2}

Definition: The Kleene star (L*) of a language is the set L* = U Ln, n >0.

Example: L = {a, bb} L* = {any string composed of a’s and

bb’s}

Definition: The Transitive Closure (L+) of a language L is the set L+ = U Ln, n > 1.

∩ ∩

Language Constructors

Note: In general, L* = L+ U {ε}, but L+ ≠ L* - {ε}.

For example, consider L = {ε}. Then {ε} = L+ ≠ L* – {ε} = {ε} – {ε} = ø.

Grammars

Goal: Providing a means for describing languages finitely.

Method: Provide a subgraph (Σ*, →*) of (Σ*, ·), and a start node S, such that the set of reachable nodes (from S) are the strings in the language.

Grammars

Example: Σ = {a, b}

L = {anbn / n > 0}

ε

a

b

aa

ab

ba

bb

aab

aaa

bbb

bba

aaba

bbaa

bbab

aabb

b

a

b

a

b

a

a

b

bb

a

a

a

b

Grammars

“=>” (derives) is a relation defined by a finite set of rewrite rules known as productions.

Definition: Given a vocabulary V, a production is a pair (u, v) V* x V*, denoted u → v. u is called the left-part; v is called the right-part.

Grammars

Example: Pseudo-English.V = {Sentence, NP, VP, Adj, N, V, boy, girl, the, tall, jealous, hit, bit}

Sentence → NP VP (one production)NP → NNP → Adj NPN → boyN → girlAdj → theAdj → tallAdj → jealousVP → V NPV → hitV → bit

Note: English is much too complicated to be described this way.

Grammars

Definition: Given a finite set of productions P V* x V* the relation => is defined such that

, β, u, v V* , uβ => vβ iff u → v P is a production.

Example: Sentence → NP VP Adj → the NP → N Adj → tall NP → Adj NP Adj → jealous N → boy VP → V NP N → girl V → hit

V → bit

Grammars

Sentence => NP VP=> Adj NP VP=> the NP VP=> the Adj NP VP=> the jealous NP VP=> the jealous N VP=> the jealous girl VP=> the jealous girl V NP=> the jealous girl hit NP => the jealous girl hit Adj NP=> the jealous girl hit the NP=> the jealous girl hit the N => the jealous girl hit the

boy

GrammarsDefinition: A grammar is a 4-tuple G = (Φ, Σ, P, S) where

Φ is a finite set of nonterminals, Σ is a finite set of terminals, V = Φ U Σ is the grammar’s vocabulary, S Φ is called the start or goal symbol, and P V* x V* is a finite set of productions.

Example: Grammar for {anbn / n > 0}.

G = (Φ, Σ, P, S), where Φ = {S}, Σ = {a, b}, and P = {S → aSb, S → ε}

Grammars

Derivations: S => aSb => aaSbb => aaaSbbb => aaaaSbbbb → …

ε ab aabb aaabbb aaaabbbb

Note: Normally, grammars are given by simply listing the productions.

=> => =>=> =>

Grammar Conventions

TWS convention

1. Upper case letter (identifier) – nonterminal2. Lower case letter (string) – terminal3. Lower case greek letter – strings in V*4. Left part of the first production is assumed to

be the start symbol, e.g.S → aSbS → ε

5. Left part omitted if same as for preceeding production, e.g.S → aSb → ε

GrammarsExample: Grammar for identifiers.

Identifier → Letter→ Identifier Letter→ Identifier Digit

Letter → ‘a’ → ‘A’ → ‘b’ → ‘B’

.

.→ ‘z’ → ‘Z’

Digit → ‘0’→ ‘1’..→ ‘9’

Grammars

Definition: The language generated by a grammar G, is the set L(G) = { Σ* | S =>* }

Definition: A sentential form generated by a grammar G is any string α such that S =>* .

Definition: A sentence generated by a

grammar G is any sentential form such that Σ*.

GrammarsExample:

sentential forms

S => aSb => aaSbb => aaaSbbb => aaaaSbbbb > … ε ab aabb aaabbb aaaabbbb

Lemma: L(G) = { | is a sentence}

Proof: Trivial.

=> => => =>=>sentences

Grammars

Example: A → aABC→ aBC

aB → ab bB → bb bC → bc CB → BC

cC → cc

GrammarsDerivations: A => aABC => aaABCBC => …

aBC aaBCBC aaaBCBCBC abC aabCBC aaaBBCBCC abc aabBCC aaaBBBCCC

aabbCC aaabBBCCC (2) aabbcC aaabbbCCC aabbcc aaabbbcCC (2)

aaabbbccc

L (G) = {anbncn | n > 1}

=>

=>

=>

=>

=>

=>

=>

=>

=>

=>

=>

=>

=>

=>

=>

=>

The Chomsky Hierarchy

A hierarchy of grammars, the languages they generate, and the machines the accept those languages.

The Chomsky HierarchyType Language

NameGrammarName

RestrictionsOn grammar

Accepting Machine

0 RecursivelyEnumerable

Unrestricted re-writing system

None Turing Machine

1 Context-Sensitive Language

Context- Sensitive Grammar

For all →, ||≤||

Linear Bounded Automaton

2 Context- Free Language

Context- Free Grammar

For all →,Φ.

Push-Down Automaton(parser)

3 RegularLanguage

RegularGrammar

For all →,Φ, UΦU{}

Finite- State Automaton

Language Hierarchy

3: Regular Languages

{an | n > 0}

2: Context-free Languages

1: Context-Sensitive Languages

{anbn | n>0}

{anbncn | n>0}

0: Recursively Enumerable Languages

English?

We will deal with type 2 (syntax) and type 3 (lexicon) languages.