advanced compilers syntax analysis - cnuplas.cnu.ac.kr/courses/2017f/a_compilers/ac 3 syntax... ·...

1

Advanced CompilersSyntax Analysis

Fall. 2017

Chungnam National Univ.

Eun-Sun Cho

2

Compiler Front-End Structure

Lexical (어휘) AnalysisSyntax (구문) Analysis

Semantic (의미) AnalysisErrors

abstract syntax tree

Source code

전처리기(preprocessor)

Trivial errorsprocessing

#include, #defines

#ifdef ...

preprocessed source code

3

Scanner (lexical analyzer) vs. Parser

Source

ProgramLexical

Analyzer tokenParser

Get token

Parser usually invokes the function (eg. scanner()) to get a token

4

Syntax Analysis Related Questions

1) How to describe the syntax?

• when creating a programming language

2) How to determine if the input token stream satisfies the syntax description?

• to check given a input token stream is correct according to the described grammar

5

(1) How to Describe The Syntax

• CFG (Context Free Grammar)

– A widely used way to define a grammar of a language

– Simple and straightforward

– Easy to implement a recognizer, automatically

• G = (V, T, P, S)

– V : the set of non terminal symbols (intermediate symbols)

– T : the set terminal symbols

– P : the set of rules : N a

where N V , a (V T)*

– S : the start symbol

• L(G) : the language created by the language G

6

• Common notations for Grammatical Symbols

– Terminal symbol (V)

• first letters of alphabet in lowercase, like a, b and c, and numbers like

0,1,2,… and 9;

• operators eg. + –

• delimiters eg. ; , ( )

• symbols between ‘ and ’ : eg. ‘if’ ‘then’

– Nonterminal symbol (N)

• first letters in uppercase, like A, B and C

• S usually means the start symbol (but not necessarily)

• combination with < and > like and

7

– Production rules (P)

• eg.

S T + T

T ‘0’

T ’1’

T ’2’

• When the left hand sides of rules are the same nonterminal, we can

merge the rules into a single rule sharing one left hand side

that is, Aa1, Aa2, …, Aak

will be Aa1|a2|…|ak

eg. T ‘0’|’1’|’2’

– Start symbol (S)

• By default, the nonterminal on the left hand side of the first production

rule is the start symbol.

8

Various CFG Notations

• BNF(Backus-Naur Form)– nonterminal symbols : with < and >

– terminal symbols : character string

::= | |

::= a | b | c | | y | z

::= 0 | 1 | 2 | | 8 | 9

• EBNF(Extended BNF)

– making use of meta symbols• to denote repetition or option, for short

::= {|}07

::= a | b | c | | y | z

::= 0 | 1 | 2 | | 9

예 ANTLR 문법 일종의 EBNF

grammar MiniC;

program : decl+;

decl : var_decl

| fun_decl ;

var_decl : type_spec IDENT ';'

| type_spec IDENT '[' ']' ';' ;

type_spec : VOID

| INT ;

fun_decl : type_spec IDENT '(' params ')' compound_stmt;

params : param ('+' param)*

| VOID ;

param : type_spec IDENT

| type_spec IDENT '[' ']' ;

... 9

10

(2) How to determine if the input token stream satisfies the syntax description

• Grammars and languages– grammar

S

– language

“똑똑한 학생들이모였습니다.”

• Derivation– to generate a language from grammar

• Syntax analysis– to build the grammatical structure in a language

– how to? .. by checking if there exist a derivation path

11

Derivation

• Derivation– goal: generating a language (a statement) from grammar

– how-to: extending non-terminals by applying production rules in sequence

– eg.

grammar: E E + E | E * E | (E) | a

language: (a+a)

derivation: E (E) (E+E) (a+E) (a+a)

• Derivation tree– an abstraction of a derivation path

E

E + E

( a a )

E

12

Order of Substitution

More than one nonterminals may appear right hand side of ‘’– Leftmost derivation) : substitute the leftmost non-terminal first

– Rightmost derivation): substitute the rightmost non-terminal first

Grammar

1. EE+E

2. EE*E

3. E(E)

4. Ea

Leftmost Derivation

E E*E 2

(E)*E 3

(E+E)*E 1

(a+E)*E 4

(a+a)*E 4

(a+a)*a 4

Rightmost Derivation

E E*E 2

E*a 4

(E)*a 3

(E+E)*a 1

(E+a)*a 4

(a+a)*a 4

Q) What about derivation trees?

13

Syntax Analysis

• Syntax analysis (= Parsing)

– The process to determine if a given string can be generated by the defined grammar

• to check if the string is derived by the grammar

– If correct statement syntactic structure

– If wrong statement error message

• Syntax analyzer (=Parser)

scanner ParserIL

GeneratorSource

Programa series

of tokens

SyntacticStructure

IL(Intermediate

Code)

14

Data Structures for Syntactic Structures

• Parse tree

– A tree for a syntactic structure

– Same as the derivation tree

• created during the application of grammar rules to the derivation

• root nodes: the start symbol of the grammar

• intermediate nodes: left hand side nonterminals of grammar

rules

• terminal nodes: terminal symbols that generate the given string

• List of production-rule numbers– series of production-rule numbers which were applied during the

derivation

15

Two Different Approaches of Syntactic Analysis

• Top-down approach

– A parse tree is built by extending the root node down to the terminal

nodes

• Bottom-up approach

– A parse tree is built by building up to the root nodes from the

terminal nodes

16

Example: (a+a)

1. E E + E 2. E E * E 3. E ( E )

4. E -E 5. E a

1) Top-down ParsingE

E

E

E

E E

E E

E

E E

E

E

E E

(((( ))))

+ + +

a a a

E

E EE E E EE

E

E

aaaaaaa +++ ( )

2) Bottom-up Parsing

( ( (

17

Two different approaches of syntactic analysis (more)

– Top-down approach

• rules are applied in the same order as in the left-most derivation

“left parse?” the list of production-rule numbers in the left-most

derivation

– Bottom-up approach

• rules are applied in the reverse order as in the right-most derivation!

– Note that bottom-up approach matches the input string from left

to right (non-terminal at the left-side of the parsing tree would be

shown first, so terminals should appear the right-side first in the

derivation path)

“right parse?” : the reversed list of production-rule numbers in the

right-most derivation

18

eg 1. abac

1. S XY 2. X aX 3. X b4. Y aY 5. Y c

S

X Y

X Ya a

b c

S

X Y

a

X

a

Y

b c

S XY aXY abY abaY abac

S XY XaY Xac aXac abac

top down

left parse : 12345

bottom up

right parse : 32541

Syntax AnalysisI. Top-Down

19

20

Top-down Approach

• More intuitive approach (similar to left derivation process)

• First,

– Derive with the first production rule of the start symbol

and produce a string

• Then,

– Compare in turns each character of the derived string and the input

string

– If the next character of the derived string and that of the input string

are the same keep proceeding

– If the derived string has a nonterminal, derive a new string with

the first rule of the nonterminal, and compare the characters of the

derived string and the input string (recursively )

– If different? … backtracking!

21

Backtracking

• If the compared characters are different to each other

– Assume that the previous rule was wrong;

rollback and apply another candidate rule than the chosen one

– If the derivation with the new rule also fails; chose another rule again

and repeatedly follow above.

– If all the above process fails and no clean rule is left, determine that

the input string is wrong

22

eg.

Is the input string accd correct?

1. S aAd 2. S aB

3. A b 4. A c

5. B ccd 6. B ddc

S

a B

ccd

S

a A d

b()

23

LL Parsing

• Left to right scanning

• Generating Left parse

• Deterministic parsing

– The production rule to be applied is selected deterministically by checking

the next one character of the input

– Preparation of LL parsing

• Analyze the grammar to figure out which production rules can be applied

for a certain character in the input string

• If there are more than two rules for one character LL parsing is

impossible

• No backtracking (advantage of LL parsing)

– If the current input character and the created terminal symbol are not the

same, LL parser determines the input string is wrong with the grammar

– Improvement on the execution time of plain top down approaches

LL Parsing CONTS’

• LL Parsing

– confining the grammar only to the well-defined ones

• Two kinds of technical efforts in LL parsing area

– To rewrite a grammar in order for LL parsing

– To analyze the grammar to extract beneficial information as much as

possible

24

25

LL(1) and LOOKAHEAD(...)

• LOOKAHEAD– The lookahead set for a non-terminal X in a grammar X A | B | C the

set of terminals that X can begin producing

– LOOKAHEAD(A)

= FIRST({ | S A VT*}

• Strong LL(1) condition

For any rules A| , strong LL condition is as follows;LOOKAHEAD(A) LOOKAHEAD(A) =

– In other words,

“LOOKAHEAD is unique”

• LL grammar?

– “a context free grammar satisfying LL condition”

• So,

– Create an LL grammar from a given arbitrary context free grammar

ambiguity elimination, left-factoring (for general ambiguities)

+ left-recursion elimination (only for LL-related ambiguities)

– Then, parse with a LL parser

26

27

1. Elimination of Ambiguities

Basic Idea) Using different grammar generating the same language

• eg. Giving different priorities to different operators

E E+E| E*E | id ambiguous…

E E +T | T

TT*F | F

Fid

• eg. Applying left associativity

• Make a unique nonterminal for each non-terminal

• The nonterminal that should be processed the last

the closes to the start symbol

E T + E

E T

T num

E E + T

E T

T num

+

1 +

2 +

3 4

+

1

+

2

+ 3

4

29

3. Left-Recursion

derived string lookahead read/unread

E 1 1+2+3+4

E+T 1 1+2+3+4

E+T+T 1 1+2+3+4

E+T+T+T 1 1+2+3+4

T+T+T+T 1 1+2+3+4

1+T+T+T 2 1+2+3+4

1+2+T+T 3 1+2+3+4

1+2+3+T 4 1+2+3+4

1+2+3+4 $ 1+2+3+4

Q) Is this alright?

E E + TE TT num

“1 + 2 + 3 + 4”

30

Left-recursion may entail infinite loop in LL parsing

Left-recursion Elimination: rewriting left-recursion into right-recursion

E +

E

T

TE

TE

TE +

+

+

E E + TE TT num

E TE’

E’ +TE’ |

T num T

E

E’

E’+ T

E’+ T

E’+ T

A A | A A´A´A´|

31

Implementation of LL(1) Parsers

1. Recursive descent parser

– Using recursion

– For each non-terminal, write one procedure

– Adv. : intuitive, easy

– Disadv. : any modification on the production rules entail modifitaion

on the parser

2. Predictive parser

– theoretically, based on ‘PDA (push down automata)’

– no need to change the parser even when the production rules are

changed

• Only modification on the parsing table should be changed

32

1.Recursive Descent Parser Example

• For each terminal symbol a,procedure pa;

begin

if nextSymbol = ta then get_nextSymbol else error

end; /*pa*/

• For each non-terminal symbol A,procedure pA;

begin

case nextSymbol of

LOOKAHEAD(AX1X2...Xm): for i:=1 to m do pXi;

LOOKAHEAD(AY1Y2...Yn): for i:=1 to n do pYi;

LOOKAHEAD(AZ1Z2...Zr): for i:=1 to r do pZi;

LOOKAHEAD(A ): ;otherwise : error

end

end; /*pA*/

33

(1) Terminal symbols

void pa(){

if (nextSymbol == ta )

nextSymbol = get_nextSymbol();

else error();

}

void pb() {

if (nextSymbol == tb )

nextSymbol = get_nextSymbol();

else error

}

S aAb

A aS | b1.Recursive Descent Parser Example

34

(2) Non-terminal Symbols

void pS() {

if (nextSymbol == ta) {

pa(); pA(); pb();

}

}

void pA() {

switch (nextSymbol) {

case ta : pa();pS(); break;

case tb : pb(); break;

default: error();

}

}

S aAb

A aS | b1.Recursive Descent Parser Example

Example-ANTLR

• ANTLR(ANother Tool for Language Recognition)

– http://www.antlr.org/

– Parser/lexer generator: takes a grammar generates a LL(k)

lexer and/or parser

– Written in Java, open source software

– The successor to the Purdue Compiler Construction Tool

Set (PCCTS),

– Can generate Java, C#, C++, Python, …

35

http://www.antlr.org/

JavaGrammar.g in ANTLR

40

2. Predictive Parser

• Build a Table!

– In LL(1), the rule to apply is uniquely

determined by the given lookahead

– Thus, a Nonterminal Lookahead table enables mechanical parsing

– eg) 1. S aS 2. S bA

3. A d 4. A ccA

a b c d

S 1 2

A 4 3d

S

a S

a S

b A

Acc

1

1

2

4

3

Syntax AnalysisII. Bottom-Up

41

42

Bottom-Up Parsing

• Bottom-up = right parse = reverse order of right parse

– starting from the terminal symbols, generating the start symbol

– compare the first character of the generated string matches rhs of the

production rule,

• if they match change with the lhs

(1+2+(3+4))+5 (T+2+(3+4))+5

(E+2+(3+4))+5 (E+T+(3+4))+5

(E+(3+4))+5 (E+(T+4))+5 (E+(E+4))+5

(E+(E+T))+5 (E+(E))+5 (E+T)+5 (E)+5

E+5 E+T E

E E + T | T

T num | (E)

43

Top-down vs. Bottom-up

scanned unscanned scanned unscanned

Top-down Bottom-up

• Bottom-up is more strong

• Selection of production rules can be put off until more tokens are inputeg. left-recurive grammar can be parsed by bottom-up parser

44

Terms : LL, LR

• LL(k)– scanning the input Left-to-right

– Left-most derivation

– looking ahead k symbols

– [Top-down or predictive] parsing or LL parser

– traversing and creating the parse tree in pre-order

• LR(k)– scanning the input Left-to-right

– Right-most derivation

– looking ahead k symbols

– [Bottom-up or shift-reduce] parsing or LR parser

– traversing and creating the parse tree in post-order

45

What is Reduce?

• When S => and there exists a production rule in the form of A , reduction is a substitution of with A in

• Parsing is done by reducing an input string until we get the start symbol

• Eg.

1. S aAcBe 2. A Ab

3. A b 4. B d

(1) reduce sequence

abbcde aAbcde (reduce 3)

aAcde (reduce 2)

aAcBe (reduce 4)

S (reduce 1)

(2) Parse tree ====>

S

A

A B

a b b c d e

46

Handle

• The portion to be reduced, in a string– If there exists a path S => A => , is said to be a handle of

Eg. 1. E E + T 2. E T 3. T T * F4. T F 5. F ( E ) 6. F a

Derivation (lhs is in blue) Reduce (handles are underlined)

E E + T (1) a + a * a F + a * a (6)

E + T * F (13) T + a * a (64)

E + T * a (136) E + a * a (642)

E + F * a (1364) E + F * a (6426)

E + a * a (13646) E + T * a (64264)

T + a * a (136462) E + T * F (642646)

F + a * a (1364624) E + T (6426463)

a + a * a (13646246) E (64264631)

47

Implementation of Parsers

• Shift-reduce parsing

– Using a stack and a special table

– Operations

• “Shift” : moving input symbols to the stack until any handle appears on

top of the stack

• “Reduce” : for a given handle, determining a production rule and

replacing the handle with the lhs of the rule

• Repeat these operations until only the start symbol is left on the stack

48

Actions in Shift-Reduce Parsing

• Shift: move a look-ahead token to the stack

– push a

• Reduce: replace the handle on top of stack with the non-

terminal symbol X (when the production rule is X )

– pop , push X

stack input action

( 1+2+(3+4))+5 shift 1

(1 +2+(3+4))+5

stack input action

(E+T +(3+4))+5 reduce E E+ T

(E +(3+4))+5

49

Eg. Shift-Reduce Parsing

1. E E + T 2. E T 3. T T * F

4. T F 5. F ( E ) 6. F a

(1) reduce process :

a + a * a F + a * a (6) T + a * a (64) E + a * a (642) E + F * a (6426) E + T * a (64264) E + T * F (642646) E + T (6426463) E (64264631)

reduce sequence : 64264631

50

Multiple Candidate Handles

• If there are more than two handles? ..... “ambiguity, conflict ..”

eg. E E + E | E * E | ( E ) | idid + id * id

E => E + E E => E * E

=> E + E * E => E * id

=> E + E * id => E + E * id

=> E + id * id => E + id * id

=> id + id * id => id + id * id

51

Conflict Resolution

• By defining priority

– reduce/reduce : compare the priorities of production rules

choose the one with higher priority

– shift/reduce : compare the priority of the production rule to reduce and

that of the input token

reduce : the rule has higher priority than the input token

shift : the input token has higher priority than the rule

E E + E | E * E | num | (E)

Priorities (to process *-operation first)

1: E * E, *

2: E +E, +

1 + 2 * 3 Shift/reduce

conflict

52

By defining priority (conts’)-Enforcing Associativity by Priorities

– Left associativity: reduce first

• the rule has higher priority than the input token

– Right associativity: shift first

• the input token has higher priority than the rule

E E + E

E num shift: 1+ (2+3)

reduce: (1+2)+3

1 + 2 + 3

Priorities

(for left-associativity)

1: E + E

2: +

More convenient method? Yes, in parser generation tools

53

Conflict Relution CONTS’

• But, how to resolve general conflicts?

– When the stack top has and the input token is b, and we have the rule

X where =

• Should we make stack top X by reducing X ?

• or b by pushing b?

– When a stack top has = and = ’’ and we have both X andX’ ’

• Which one should be the next handle or ’ ?

• Solution

– Using “parser states” : guiding information to select actions and

handles

– Much less conflicts

54

LR Parser

• LR(k) – Left-to-right scan, right-most derivation, k lookahead characters

– Basics: LR(0), LR(1) ...

– Variations : SLR,... LALR(1)

• Example– Yacc (recall “lex&yacc”) … we will see in the next chapter

55

Categories of Syntax Analysis Methods

LR(0)

SLR

LALR(1)

LR(1)

LL(1)

LR(k) LR(k+1)

LL(k) LL(k+1)

LL(k) LR(k)

LR(0) SLR

LALR(1) LR(1)

advanced compilers syntax analysis - cnuplas.cnu.ac.kr/courses/2017f/a_compilers/ac 3 syntax... ·...

Documents