-
1
Advanced CompilersSyntax Analysis
Fall. 2017
Chungnam National Univ.
Eun-Sun Cho
-
2
Compiler Front-End Structure
Lexical (어휘) AnalysisSyntax (구문) Analysis
Semantic (의미) AnalysisErrors
abstract syntax tree
Source code
전처리기(preprocessor)
Trivial errorsprocessing
#include, #defines
#ifdef ...
preprocessed source code
-
3
Scanner (lexical analyzer) vs. Parser
Source
ProgramLexical
Analyzer tokenParser
Get token
Parser usually invokes the function (eg. scanner()) to get a token
-
4
Syntax Analysis Related Questions
1) How to describe the syntax?
• when creating a programming language
2) How to determine if the input token stream satisfies the syntax description?
• to check given a input token stream is correct according to the described grammar
-
5
(1) How to Describe The Syntax
• CFG (Context Free Grammar)
– A widely used way to define a grammar of a language
– Simple and straightforward
– Easy to implement a recognizer, automatically
• G = (V, T, P, S)
– V : the set of non terminal symbols (intermediate symbols)
– T : the set terminal symbols
– P : the set of rules : N a
where N V , a (V T)*
– S : the start symbol
• L(G) : the language created by the language G
-
6
• Common notations for Grammatical Symbols
– Terminal symbol (V)
• first letters of alphabet in lowercase, like a, b and c, and numbers like
0,1,2,… and 9;
• operators eg. + –
• delimiters eg. ; , ( )
• symbols between ‘ and ’ : eg. ‘if’ ‘then’
– Nonterminal symbol (N)
• first letters in uppercase, like A, B and C
• S usually means the start symbol (but not necessarily)
• combination with < and > like and
-
7
– Production rules (P)
• eg.
S T + T
T ‘0’
T ’1’
T ’2’
• When the left hand sides of rules are the same nonterminal, we can
merge the rules into a single rule sharing one left hand side
that is, Aa1, Aa2, …, Aak
will be Aa1|a2|…|ak
eg. T ‘0’|’1’|’2’
– Start symbol (S)
• By default, the nonterminal on the left hand side of the first production
rule is the start symbol.
-
8
Various CFG Notations
• BNF(Backus-Naur Form)– nonterminal symbols : with < and >
– terminal symbols : character string
::= | |
::= a | b | c | | y | z
::= 0 | 1 | 2 | | 8 | 9
• EBNF(Extended BNF)
– making use of meta symbols• to denote repetition or option, for short
::= {|}07
::= a | b | c | | y | z
::= 0 | 1 | 2 | | 9
-
예 ANTLR 문법 일종의 EBNF
grammar MiniC;
program : decl+;
decl : var_decl
| fun_decl ;
var_decl : type_spec IDENT ';'
| type_spec IDENT '[' ']' ';' ;
type_spec : VOID
| INT ;
fun_decl : type_spec IDENT '(' params ')' compound_stmt;
params : param ('+' param)*
| VOID ;
param : type_spec IDENT
| type_spec IDENT '[' ']' ;
... 9
-
10
(2) How to determine if the input token stream satisfies the syntax description
• Grammars and languages– grammar
S
– language
“똑똑한 학생들이모였습니다.”
• Derivation– to generate a language from grammar
• Syntax analysis– to build the grammatical structure in a language
– how to? .. by checking if there exist a derivation path
-
11
Derivation
• Derivation– goal: generating a language (a statement) from grammar
– how-to: extending non-terminals by applying production rules in sequence
– eg.
grammar: E E + E | E * E | (E) | a
language: (a+a)
derivation: E (E) (E+E) (a+E) (a+a)
• Derivation tree– an abstraction of a derivation path
E
E + E
( a a )
E
-
12
Order of Substitution
More than one nonterminals may appear right hand side of ‘’– Leftmost derivation) : substitute the leftmost non-terminal first
– Rightmost derivation): substitute the rightmost non-terminal first
Grammar
1. EE+E
2. EE*E
3. E(E)
4. Ea
Leftmost Derivation
E E*E 2
(E)*E 3
(E+E)*E 1
(a+E)*E 4
(a+a)*E 4
(a+a)*a 4
Rightmost Derivation
E E*E 2
E*a 4
(E)*a 3
(E+E)*a 1
(E+a)*a 4
(a+a)*a 4
Q) What about derivation trees?
-
13
Syntax Analysis
• Syntax analysis (= Parsing)
– The process to determine if a given string can be generated by the defined grammar
• to check if the string is derived by the grammar
– If correct statement syntactic structure
– If wrong statement error message
• Syntax analyzer (=Parser)
scanner ParserIL
GeneratorSource
Programa series
of tokens
SyntacticStructure
IL(Intermediate
Code)
-
14
Data Structures for Syntactic Structures
• Parse tree
– A tree for a syntactic structure
– Same as the derivation tree
• created during the application of grammar rules to the derivation
• root nodes: the start symbol of the grammar
• intermediate nodes: left hand side nonterminals of grammar
rules
• terminal nodes: terminal symbols that generate the given string
• List of production-rule numbers– series of production-rule numbers which were applied during the
derivation
-
15
Two Different Approaches of Syntactic Analysis
• Top-down approach
– A parse tree is built by extending the root node down to the terminal
nodes
• Bottom-up approach
– A parse tree is built by building up to the root nodes from the
terminal nodes
-
16
Example: (a+a)
1. E E + E 2. E E * E 3. E ( E )
4. E -E 5. E a
1) Top-down ParsingE
E
E
E
E E
E E
E
E E
E
E
E E
(((( ))))
+ + +
a a a
E
E EE E E EE
E
E
aaaaaaa +++ ( )
2) Bottom-up Parsing
( ( (
-
17
Two different approaches of syntactic analysis (more)
– Top-down approach
• rules are applied in the same order as in the left-most derivation
“left parse?” the list of production-rule numbers in the left-most
derivation
– Bottom-up approach
• rules are applied in the reverse order as in the right-most derivation!
– Note that bottom-up approach matches the input string from left
to right (non-terminal at the left-side of the parsing tree would be
shown first, so terminals should appear the right-side first in the
derivation path)
“right parse?” : the reversed list of production-rule numbers in the
right-most derivation
-
18
eg 1. abac
1. S XY 2. X aX 3. X b4. Y aY 5. Y c
S
X Y
X Ya a
b c
S
X Y
a
X
a
Y
b c
S XY aXY abY abaY abac
S XY XaY Xac aXac abac
top down
left parse : 12345
bottom up
right parse : 32541
-
Syntax AnalysisI. Top-Down
19
-
20
Top-down Approach
• More intuitive approach (similar to left derivation process)
• First,
– Derive with the first production rule of the start symbol
and produce a string
• Then,
– Compare in turns each character of the derived string and the input
string
– If the next character of the derived string and that of the input string
are the same keep proceeding
– If the derived string has a nonterminal, derive a new string with
the first rule of the nonterminal, and compare the characters of the
derived string and the input string (recursively )
– If different? … backtracking!
-
21
Backtracking
• If the compared characters are different to each other
– Assume that the previous rule was wrong;
rollback and apply another candidate rule than the chosen one
– If the derivation with the new rule also fails; chose another rule again
and repeatedly follow above.
– If all the above process fails and no clean rule is left, determine that
the input string is wrong
-
22
eg.
Is the input string accd correct?
1. S aAd 2. S aB
3. A b 4. A c
5. B ccd 6. B ddc
S
a B
ccd
S
a A d
b()
-
23
LL Parsing
• Left to right scanning
• Generating Left parse
• Deterministic parsing
– The production rule to be applied is selected deterministically by checking
the next one character of the input
– Preparation of LL parsing
• Analyze the grammar to figure out which production rules can be applied
for a certain character in the input string
• If there are more than two rules for one character LL parsing is
impossible
• No backtracking (advantage of LL parsing)
– If the current input character and the created terminal symbol are not the
same, LL parser determines the input string is wrong with the grammar
– Improvement on the execution time of plain top down approaches
-
LL Parsing CONTS’
• LL Parsing
– confining the grammar only to the well-defined ones
• Two kinds of technical efforts in LL parsing area
– To rewrite a grammar in order for LL parsing
– To analyze the grammar to extract beneficial information as much as
possible
24
-
25
LL(1) and LOOKAHEAD(...)
• LOOKAHEAD– The lookahead set for a non-terminal X in a grammar X A | B | C the
set of terminals that X can begin producing
– LOOKAHEAD(A)
= FIRST({ | S A VT*}
• Strong LL(1) condition
For any rules A| , strong LL condition is as follows;LOOKAHEAD(A) LOOKAHEAD(A) =
– In other words,
“LOOKAHEAD is unique”
-
• LL grammar?
– “a context free grammar satisfying LL condition”
• So,
– Create an LL grammar from a given arbitrary context free grammar
ambiguity elimination, left-factoring (for general ambiguities)
+ left-recursion elimination (only for LL-related ambiguities)
– Then, parse with a LL parser
26
-
27
1. Elimination of Ambiguities
Basic Idea) Using different grammar generating the same language
• eg. Giving different priorities to different operators
E E+E| E*E | id ambiguous…
E E +T | T
TT*F | F
Fid
• eg. Applying left associativity
• Make a unique nonterminal for each non-terminal
• The nonterminal that should be processed the last
the closes to the start symbol
E T + E
E T
T num
E E + T
E T
T num
+
1 +
2 +
3 4
+
1
+
2
+ 3
4
-
28
2. Left Factoring
• Introduce a new non-terminal for the common prefix, and fold the rules
eg.
A |
==> A A´
A´ |
eg.
S iCtS | iCtSeS | a
C b
==> S iCtSS´ | a
S´ eS |
C b
-
29
3. Left-Recursion
derived string lookahead read/unread
E 1 1+2+3+4
E+T 1 1+2+3+4
E+T+T 1 1+2+3+4
E+T+T+T 1 1+2+3+4
T+T+T+T 1 1+2+3+4
1+T+T+T 2 1+2+3+4
1+2+T+T 3 1+2+3+4
1+2+3+T 4 1+2+3+4
1+2+3+4 $ 1+2+3+4
Q) Is this alright?
E E + TE TT num
“1 + 2 + 3 + 4”
-
30
Left-recursion may entail infinite loop in LL parsing
Left-recursion Elimination: rewriting left-recursion into right-recursion
E +
E
T
TE
TE
TE +
+
+
E E + TE TT num
E TE’
E’ +TE’ |
T num T
E
E’
E’+ T
E’+ T
E’+ T
A A | A A´A´A´|
-
31
Implementation of LL(1) Parsers
1. Recursive descent parser
– Using recursion
– For each non-terminal, write one procedure
– Adv. : intuitive, easy
– Disadv. : any modification on the production rules entail modifitaion
on the parser
2. Predictive parser
– theoretically, based on ‘PDA (push down automata)’
– no need to change the parser even when the production rules are
changed
• Only modification on the parsing table should be changed
-
32
1.Recursive Descent Parser Example
• For each terminal symbol a,procedure pa;
begin
if nextSymbol = ta then get_nextSymbol else error
end; /*pa*/
• For each non-terminal symbol A,procedure pA;
begin
case nextSymbol of
LOOKAHEAD(AX1X2...Xm): for i:=1 to m do pXi;
LOOKAHEAD(AY1Y2...Yn): for i:=1 to n do pYi;
LOOKAHEAD(AZ1Z2...Zr): for i:=1 to r do pZi;
LOOKAHEAD(A ): ;otherwise : error
end
end; /*pA*/
-
33
(1) Terminal symbols
void pa(){
if (nextSymbol == ta )
nextSymbol = get_nextSymbol();
else error();
}
void pb() {
if (nextSymbol == tb )
nextSymbol = get_nextSymbol();
else error
}
S aAb
A aS | b1.Recursive Descent Parser Example
-
34
(2) Non-terminal Symbols
void pS() {
if (nextSymbol == ta) {
pa(); pA(); pb();
}
}
void pA() {
switch (nextSymbol) {
case ta : pa();pS(); break;
case tb : pb(); break;
default: error();
}
}
S aAb
A aS | b1.Recursive Descent Parser Example
-
Example-ANTLR
• ANTLR(ANother Tool for Language Recognition)
– http://www.antlr.org/
– Parser/lexer generator: takes a grammar generates a LL(k)
lexer and/or parser
– Written in Java, open source software
– The successor to the Purdue Compiler Construction Tool
Set (PCCTS),
– Can generate Java, C#, C++, Python, …
35
http://www.antlr.org/
-
36
-
37
-
JavaGrammar.g in ANTLR
-
40
2. Predictive Parser
• Build a Table!
– In LL(1), the rule to apply is uniquely
determined by the given lookahead
– Thus, a Nonterminal Lookahead table enables mechanical parsing
– eg) 1. S aS 2. S bA
3. A d 4. A ccA
a b c d
S 1 2
A 4 3d
S
a S
a S
b A
Acc
1
1
2
4
3
-
Syntax AnalysisII. Bottom-Up
41
-
42
Bottom-Up Parsing
• Bottom-up = right parse = reverse order of right parse
– starting from the terminal symbols, generating the start symbol
– compare the first character of the generated string matches rhs of the
production rule,
• if they match change with the lhs
(1+2+(3+4))+5 (T+2+(3+4))+5
(E+2+(3+4))+5 (E+T+(3+4))+5
(E+(3+4))+5 (E+(T+4))+5 (E+(E+4))+5
(E+(E+T))+5 (E+(E))+5 (E+T)+5 (E)+5
E+5 E+T E
E E + T | T
T num | (E)
-
43
Top-down vs. Bottom-up
scanned unscanned scanned unscanned
Top-down Bottom-up
• Bottom-up is more strong
• Selection of production rules can be put off until more tokens are inputeg. left-recurive grammar can be parsed by bottom-up parser
-
44
Terms : LL, LR
• LL(k)– scanning the input Left-to-right
– Left-most derivation
– looking ahead k symbols
– [Top-down or predictive] parsing or LL parser
– traversing and creating the parse tree in pre-order
• LR(k)– scanning the input Left-to-right
– Right-most derivation
– looking ahead k symbols
– [Bottom-up or shift-reduce] parsing or LR parser
– traversing and creating the parse tree in post-order
-
45
What is Reduce?
• When S => and there exists a production rule in the form of A , reduction is a substitution of with A in
• Parsing is done by reducing an input string until we get the start symbol
• Eg.
1. S aAcBe 2. A Ab
3. A b 4. B d
(1) reduce sequence
abbcde aAbcde (reduce 3)
aAcde (reduce 2)
aAcBe (reduce 4)
S (reduce 1)
(2) Parse tree ====>
S
A
A B
a b b c d e
-
46
Handle
• The portion to be reduced, in a string– If there exists a path S => A => , is said to be a handle of
Eg. 1. E E + T 2. E T 3. T T * F4. T F 5. F ( E ) 6. F a
Derivation (lhs is in blue) Reduce (handles are underlined)
E E + T (1) a + a * a F + a * a (6)
E + T * F (13) T + a * a (64)
E + T * a (136) E + a * a (642)
E + F * a (1364) E + F * a (6426)
E + a * a (13646) E + T * a (64264)
T + a * a (136462) E + T * F (642646)
F + a * a (1364624) E + T (6426463)
a + a * a (13646246) E (64264631)
-
47
Implementation of Parsers
• Shift-reduce parsing
– Using a stack and a special table
– Operations
• “Shift” : moving input symbols to the stack until any handle appears on
top of the stack
• “Reduce” : for a given handle, determining a production rule and
replacing the handle with the lhs of the rule
• Repeat these operations until only the start symbol is left on the stack
-
48
Actions in Shift-Reduce Parsing
• Shift: move a look-ahead token to the stack
– push a
• Reduce: replace the handle on top of stack with the non-
terminal symbol X (when the production rule is X )
– pop , push X
stack input action
( 1+2+(3+4))+5 shift 1
(1 +2+(3+4))+5
stack input action
(E+T +(3+4))+5 reduce E E+ T
(E +(3+4))+5
-
49
Eg. Shift-Reduce Parsing
1. E E + T 2. E T 3. T T * F
4. T F 5. F ( E ) 6. F a
(1) reduce process :
a + a * a F + a * a (6) T + a * a (64) E + a * a (642) E + F * a (6426) E + T * a (64264) E + T * F (642646) E + T (6426463) E (64264631)
reduce sequence : 64264631
-
50
Multiple Candidate Handles
• If there are more than two handles? ..... “ambiguity, conflict ..”
eg. E E + E | E * E | ( E ) | idid + id * id
E => E + E E => E * E
=> E + E * E => E * id
=> E + E * id => E + E * id
=> E + id * id => E + id * id
=> id + id * id => id + id * id
-
51
Conflict Resolution
• By defining priority
– reduce/reduce : compare the priorities of production rules
choose the one with higher priority
– shift/reduce : compare the priority of the production rule to reduce and
that of the input token
reduce : the rule has higher priority than the input token
shift : the input token has higher priority than the rule
E E + E | E * E | num | (E)
Priorities (to process *-operation first)
1: E * E, *
2: E +E, +
1 + 2 * 3 Shift/reduce
conflict
-
52
By defining priority (conts’)-Enforcing Associativity by Priorities
– Left associativity: reduce first
• the rule has higher priority than the input token
– Right associativity: shift first
• the input token has higher priority than the rule
E E + E
E num shift: 1+ (2+3)
reduce: (1+2)+3
1 + 2 + 3
Priorities
(for left-associativity)
1: E + E
2: +
More convenient method? Yes, in parser generation tools
-
53
Conflict Relution CONTS’
• But, how to resolve general conflicts?
– When the stack top has and the input token is b, and we have the rule
X where =
• Should we make stack top X by reducing X ?
• or b by pushing b?
– When a stack top has = and = ’’ and we have both X andX’ ’
• Which one should be the next handle or ’ ?
• Solution
– Using “parser states” : guiding information to select actions and
handles
– Much less conflicts
-
54
LR Parser
• LR(k) – Left-to-right scan, right-most derivation, k lookahead characters
– Basics: LR(0), LR(1) ...
– Variations : SLR,... LALR(1)
• Example– Yacc (recall “lex&yacc”) … we will see in the next chapter
-
55
Categories of Syntax Analysis Methods
LR(0)
SLR
LALR(1)
LR(1)
LL(1)
LR(k) LR(k+1)
LL(k) LL(k+1)
LL(k) LR(k)
LR(0) SLR
LALR(1) LR(1)