advanced compilers syntax analysis - cnuplas.cnu.ac.kr/courses/2017f/a_compilers/ac 3 syntax... ·...

55
1 Advanced Compilers Syntax Analysis Fall. 2017 Chungnam National Univ. Eun-Sun Cho

Upload: others

Post on 30-Jan-2021

23 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Advanced CompilersSyntax Analysis

    Fall. 2017

    Chungnam National Univ.

    Eun-Sun Cho

  • 2

    Compiler Front-End Structure

    Lexical (어휘) AnalysisSyntax (구문) Analysis

    Semantic (의미) AnalysisErrors

    abstract syntax tree

    Source code

    전처리기(preprocessor)

    Trivial errorsprocessing

    #include, #defines

    #ifdef ...

    preprocessed source code

  • 3

    Scanner (lexical analyzer) vs. Parser

    Source

    ProgramLexical

    Analyzer tokenParser

    Get token

    Parser usually invokes the function (eg. scanner()) to get a token

  • 4

    Syntax Analysis Related Questions

    1) How to describe the syntax?

    • when creating a programming language

    2) How to determine if the input token stream satisfies the syntax description?

    • to check given a input token stream is correct according to the described grammar

  • 5

    (1) How to Describe The Syntax

    • CFG (Context Free Grammar)

    – A widely used way to define a grammar of a language

    – Simple and straightforward

    – Easy to implement a recognizer, automatically

    • G = (V, T, P, S)

    – V : the set of non terminal symbols (intermediate symbols)

    – T : the set terminal symbols

    – P : the set of rules : N a

    where N V , a (V T)*

    – S : the start symbol

    • L(G) : the language created by the language G

  • 6

    • Common notations for Grammatical Symbols

    – Terminal symbol (V)

    • first letters of alphabet in lowercase, like a, b and c, and numbers like

    0,1,2,… and 9;

    • operators eg. + –

    • delimiters eg. ; , ( )

    • symbols between ‘ and ’ : eg. ‘if’ ‘then’

    – Nonterminal symbol (N)

    • first letters in uppercase, like A, B and C

    • S usually means the start symbol (but not necessarily)

    • combination with < and > like and

  • 7

    – Production rules (P)

    • eg.

    S T + T

    T ‘0’

    T ’1’

    T ’2’

    • When the left hand sides of rules are the same nonterminal, we can

    merge the rules into a single rule sharing one left hand side

    that is, Aa1, Aa2, …, Aak

    will be Aa1|a2|…|ak

    eg. T ‘0’|’1’|’2’

    – Start symbol (S)

    • By default, the nonterminal on the left hand side of the first production

    rule is the start symbol.

  • 8

    Various CFG Notations

    • BNF(Backus-Naur Form)– nonterminal symbols : with < and >

    – terminal symbols : character string

    ::= | |

    ::= a | b | c | | y | z

    ::= 0 | 1 | 2 | | 8 | 9

    • EBNF(Extended BNF)

    – making use of meta symbols• to denote repetition or option, for short

    ::= {|}07

    ::= a | b | c | | y | z

    ::= 0 | 1 | 2 | | 9

  • 예 ANTLR 문법 일종의 EBNF

    grammar MiniC;

    program : decl+;

    decl : var_decl

    | fun_decl ;

    var_decl : type_spec IDENT ';'

    | type_spec IDENT '[' ']' ';' ;

    type_spec : VOID

    | INT ;

    fun_decl : type_spec IDENT '(' params ')' compound_stmt;

    params : param ('+' param)*

    | VOID ;

    param : type_spec IDENT

    | type_spec IDENT '[' ']' ;

    ... 9

  • 10

    (2) How to determine if the input token stream satisfies the syntax description

    • Grammars and languages– grammar

    S

    – language

    “똑똑한 학생들이모였습니다.”

    • Derivation– to generate a language from grammar

    • Syntax analysis– to build the grammatical structure in a language

    – how to? .. by checking if there exist a derivation path

  • 11

    Derivation

    • Derivation– goal: generating a language (a statement) from grammar

    – how-to: extending non-terminals by applying production rules in sequence

    – eg.

    grammar: E E + E | E * E | (E) | a

    language: (a+a)

    derivation: E (E) (E+E) (a+E) (a+a)

    • Derivation tree– an abstraction of a derivation path

    E

    E + E

    ( a a )

    E

  • 12

    Order of Substitution

    More than one nonterminals may appear right hand side of ‘’– Leftmost derivation) : substitute the leftmost non-terminal first

    – Rightmost derivation): substitute the rightmost non-terminal first

    Grammar

    1. EE+E

    2. EE*E

    3. E(E)

    4. Ea

    Leftmost Derivation

    E E*E 2

    (E)*E 3

    (E+E)*E 1

    (a+E)*E 4

    (a+a)*E 4

    (a+a)*a 4

    Rightmost Derivation

    E E*E 2

    E*a 4

    (E)*a 3

    (E+E)*a 1

    (E+a)*a 4

    (a+a)*a 4

    Q) What about derivation trees?

  • 13

    Syntax Analysis

    • Syntax analysis (= Parsing)

    – The process to determine if a given string can be generated by the defined grammar

    • to check if the string is derived by the grammar

    – If correct statement syntactic structure

    – If wrong statement error message

    • Syntax analyzer (=Parser)

    scanner ParserIL

    GeneratorSource

    Programa series

    of tokens

    SyntacticStructure

    IL(Intermediate

    Code)

  • 14

    Data Structures for Syntactic Structures

    • Parse tree

    – A tree for a syntactic structure

    – Same as the derivation tree

    • created during the application of grammar rules to the derivation

    • root nodes: the start symbol of the grammar

    • intermediate nodes: left hand side nonterminals of grammar

    rules

    • terminal nodes: terminal symbols that generate the given string

    • List of production-rule numbers– series of production-rule numbers which were applied during the

    derivation

  • 15

    Two Different Approaches of Syntactic Analysis

    • Top-down approach

    – A parse tree is built by extending the root node down to the terminal

    nodes

    • Bottom-up approach

    – A parse tree is built by building up to the root nodes from the

    terminal nodes

  • 16

    Example: (a+a)

    1. E E + E 2. E E * E 3. E ( E )

    4. E -E 5. E a

    1) Top-down ParsingE

    E

    E

    E

    E E

    E E

    E

    E E

    E

    E

    E E

    (((( ))))

    + + +

    a a a

    E

    E EE E E EE

    E

    E

    aaaaaaa +++ ( )

    2) Bottom-up Parsing

    ( ( (

  • 17

    Two different approaches of syntactic analysis (more)

    – Top-down approach

    • rules are applied in the same order as in the left-most derivation

    “left parse?” the list of production-rule numbers in the left-most

    derivation

    – Bottom-up approach

    • rules are applied in the reverse order as in the right-most derivation!

    – Note that bottom-up approach matches the input string from left

    to right (non-terminal at the left-side of the parsing tree would be

    shown first, so terminals should appear the right-side first in the

    derivation path)

    “right parse?” : the reversed list of production-rule numbers in the

    right-most derivation

  • 18

    eg 1. abac

    1. S XY 2. X aX 3. X b4. Y aY 5. Y c

    S

    X Y

    X Ya a

    b c

    S

    X Y

    a

    X

    a

    Y

    b c

    S XY aXY abY abaY abac

    S XY XaY Xac aXac abac

    top down

    left parse : 12345

    bottom up

    right parse : 32541

  • Syntax AnalysisI. Top-Down

    19

  • 20

    Top-down Approach

    • More intuitive approach (similar to left derivation process)

    • First,

    – Derive with the first production rule of the start symbol

    and produce a string

    • Then,

    – Compare in turns each character of the derived string and the input

    string

    – If the next character of the derived string and that of the input string

    are the same keep proceeding

    – If the derived string has a nonterminal, derive a new string with

    the first rule of the nonterminal, and compare the characters of the

    derived string and the input string (recursively )

    – If different? … backtracking!

  • 21

    Backtracking

    • If the compared characters are different to each other

    – Assume that the previous rule was wrong;

    rollback and apply another candidate rule than the chosen one

    – If the derivation with the new rule also fails; chose another rule again

    and repeatedly follow above.

    – If all the above process fails and no clean rule is left, determine that

    the input string is wrong

  • 22

    eg.

    Is the input string accd correct?

    1. S aAd 2. S aB

    3. A b 4. A c

    5. B ccd 6. B ddc

    S

    a B

    ccd

    S

    a A d

    b()

  • 23

    LL Parsing

    • Left to right scanning

    • Generating Left parse

    • Deterministic parsing

    – The production rule to be applied is selected deterministically by checking

    the next one character of the input

    – Preparation of LL parsing

    • Analyze the grammar to figure out which production rules can be applied

    for a certain character in the input string

    • If there are more than two rules for one character LL parsing is

    impossible

    • No backtracking (advantage of LL parsing)

    – If the current input character and the created terminal symbol are not the

    same, LL parser determines the input string is wrong with the grammar

    – Improvement on the execution time of plain top down approaches

  • LL Parsing CONTS’

    • LL Parsing

    – confining the grammar only to the well-defined ones

    • Two kinds of technical efforts in LL parsing area

    – To rewrite a grammar in order for LL parsing

    – To analyze the grammar to extract beneficial information as much as

    possible

    24

  • 25

    LL(1) and LOOKAHEAD(...)

    • LOOKAHEAD– The lookahead set for a non-terminal X in a grammar X A | B | C the

    set of terminals that X can begin producing

    – LOOKAHEAD(A)

    = FIRST({ | S A VT*}

    • Strong LL(1) condition

    For any rules A| , strong LL condition is as follows;LOOKAHEAD(A) LOOKAHEAD(A) =

    – In other words,

    “LOOKAHEAD is unique”

  • • LL grammar?

    – “a context free grammar satisfying LL condition”

    • So,

    – Create an LL grammar from a given arbitrary context free grammar

    ambiguity elimination, left-factoring (for general ambiguities)

    + left-recursion elimination (only for LL-related ambiguities)

    – Then, parse with a LL parser

    26

  • 27

    1. Elimination of Ambiguities

    Basic Idea) Using different grammar generating the same language

    • eg. Giving different priorities to different operators

    E E+E| E*E | id ambiguous…

    E E +T | T

    TT*F | F

    Fid

    • eg. Applying left associativity

    • Make a unique nonterminal for each non-terminal

    • The nonterminal that should be processed the last

    the closes to the start symbol

    E T + E

    E T

    T num

    E E + T

    E T

    T num

    +

    1 +

    2 +

    3 4

    +

    1

    +

    2

    + 3

    4

  • 28

    2. Left Factoring

    • Introduce a new non-terminal for the common prefix, and fold the rules

    eg.

    A |

    ==> A A´

    A´ |

    eg.

    S iCtS | iCtSeS | a

    C b

    ==> S iCtSS´ | a

    S´ eS |

    C b

  • 29

    3. Left-Recursion

    derived string lookahead read/unread

    E 1 1+2+3+4

    E+T 1 1+2+3+4

    E+T+T 1 1+2+3+4

    E+T+T+T 1 1+2+3+4

    T+T+T+T 1 1+2+3+4

    1+T+T+T 2 1+2+3+4

    1+2+T+T 3 1+2+3+4

    1+2+3+T 4 1+2+3+4

    1+2+3+4 $ 1+2+3+4

    Q) Is this alright?

    E E + TE TT num

    “1 + 2 + 3 + 4”

  • 30

    Left-recursion may entail infinite loop in LL parsing

    Left-recursion Elimination: rewriting left-recursion into right-recursion

    E +

    E

    T

    TE

    TE

    TE +

    +

    +

    E E + TE TT num

    E TE’

    E’ +TE’ |

    T num T

    E

    E’

    E’+ T

    E’+ T

    E’+ T

    A A | A A´A´A´|

  • 31

    Implementation of LL(1) Parsers

    1. Recursive descent parser

    – Using recursion

    – For each non-terminal, write one procedure

    – Adv. : intuitive, easy

    – Disadv. : any modification on the production rules entail modifitaion

    on the parser

    2. Predictive parser

    – theoretically, based on ‘PDA (push down automata)’

    – no need to change the parser even when the production rules are

    changed

    • Only modification on the parsing table should be changed

  • 32

    1.Recursive Descent Parser Example

    • For each terminal symbol a,procedure pa;

    begin

    if nextSymbol = ta then get_nextSymbol else error

    end; /*pa*/

    • For each non-terminal symbol A,procedure pA;

    begin

    case nextSymbol of

    LOOKAHEAD(AX1X2...Xm): for i:=1 to m do pXi;

    LOOKAHEAD(AY1Y2...Yn): for i:=1 to n do pYi;

    LOOKAHEAD(AZ1Z2...Zr): for i:=1 to r do pZi;

    LOOKAHEAD(A ): ;otherwise : error

    end

    end; /*pA*/

  • 33

    (1) Terminal symbols

    void pa(){

    if (nextSymbol == ta )

    nextSymbol = get_nextSymbol();

    else error();

    }

    void pb() {

    if (nextSymbol == tb )

    nextSymbol = get_nextSymbol();

    else error

    }

    S aAb

    A aS | b1.Recursive Descent Parser Example

  • 34

    (2) Non-terminal Symbols

    void pS() {

    if (nextSymbol == ta) {

    pa(); pA(); pb();

    }

    }

    void pA() {

    switch (nextSymbol) {

    case ta : pa();pS(); break;

    case tb : pb(); break;

    default: error();

    }

    }

    S aAb

    A aS | b1.Recursive Descent Parser Example

  • Example-ANTLR

    • ANTLR(ANother Tool for Language Recognition)

    – http://www.antlr.org/

    – Parser/lexer generator: takes a grammar generates a LL(k)

    lexer and/or parser

    – Written in Java, open source software

    – The successor to the Purdue Compiler Construction Tool

    Set (PCCTS),

    – Can generate Java, C#, C++, Python, …

    35

    http://www.antlr.org/

  • 36

  • 37

  • JavaGrammar.g in ANTLR

  • 40

    2. Predictive Parser

    • Build a Table!

    – In LL(1), the rule to apply is uniquely

    determined by the given lookahead

    – Thus, a Nonterminal Lookahead table enables mechanical parsing

    – eg) 1. S aS 2. S bA

    3. A d 4. A ccA

    a b c d

    S 1 2

    A 4 3d

    S

    a S

    a S

    b A

    Acc

    1

    1

    2

    4

    3

  • Syntax AnalysisII. Bottom-Up

    41

  • 42

    Bottom-Up Parsing

    • Bottom-up = right parse = reverse order of right parse

    – starting from the terminal symbols, generating the start symbol

    – compare the first character of the generated string matches rhs of the

    production rule,

    • if they match change with the lhs

    (1+2+(3+4))+5 (T+2+(3+4))+5

    (E+2+(3+4))+5 (E+T+(3+4))+5

    (E+(3+4))+5 (E+(T+4))+5 (E+(E+4))+5

    (E+(E+T))+5 (E+(E))+5 (E+T)+5 (E)+5

    E+5 E+T E

    E E + T | T

    T num | (E)

  • 43

    Top-down vs. Bottom-up

    scanned unscanned scanned unscanned

    Top-down Bottom-up

    • Bottom-up is more strong

    • Selection of production rules can be put off until more tokens are inputeg. left-recurive grammar can be parsed by bottom-up parser

  • 44

    Terms : LL, LR

    • LL(k)– scanning the input Left-to-right

    – Left-most derivation

    – looking ahead k symbols

    – [Top-down or predictive] parsing or LL parser

    – traversing and creating the parse tree in pre-order

    • LR(k)– scanning the input Left-to-right

    – Right-most derivation

    – looking ahead k symbols

    – [Bottom-up or shift-reduce] parsing or LR parser

    – traversing and creating the parse tree in post-order

  • 45

    What is Reduce?

    • When S => and there exists a production rule in the form of A , reduction is a substitution of with A in

    • Parsing is done by reducing an input string until we get the start symbol

    • Eg.

    1. S aAcBe 2. A Ab

    3. A b 4. B d

    (1) reduce sequence

    abbcde aAbcde (reduce 3)

    aAcde (reduce 2)

    aAcBe (reduce 4)

    S (reduce 1)

    (2) Parse tree ====>

    S

    A

    A B

    a b b c d e

  • 46

    Handle

    • The portion to be reduced, in a string– If there exists a path S => A => , is said to be a handle of

    Eg. 1. E E + T 2. E T 3. T T * F4. T F 5. F ( E ) 6. F a

    Derivation (lhs is in blue) Reduce (handles are underlined)

    E E + T (1) a + a * a F + a * a (6)

    E + T * F (13) T + a * a (64)

    E + T * a (136) E + a * a (642)

    E + F * a (1364) E + F * a (6426)

    E + a * a (13646) E + T * a (64264)

    T + a * a (136462) E + T * F (642646)

    F + a * a (1364624) E + T (6426463)

    a + a * a (13646246) E (64264631)

  • 47

    Implementation of Parsers

    • Shift-reduce parsing

    – Using a stack and a special table

    – Operations

    • “Shift” : moving input symbols to the stack until any handle appears on

    top of the stack

    • “Reduce” : for a given handle, determining a production rule and

    replacing the handle with the lhs of the rule

    • Repeat these operations until only the start symbol is left on the stack

  • 48

    Actions in Shift-Reduce Parsing

    • Shift: move a look-ahead token to the stack

    – push a

    • Reduce: replace the handle on top of stack with the non-

    terminal symbol X (when the production rule is X )

    – pop , push X

    stack input action

    ( 1+2+(3+4))+5 shift 1

    (1 +2+(3+4))+5

    stack input action

    (E+T +(3+4))+5 reduce E E+ T

    (E +(3+4))+5

  • 49

    Eg. Shift-Reduce Parsing

    1. E E + T 2. E T 3. T T * F

    4. T F 5. F ( E ) 6. F a

    (1) reduce process :

    a + a * a F + a * a (6) T + a * a (64) E + a * a (642) E + F * a (6426) E + T * a (64264) E + T * F (642646) E + T (6426463) E (64264631)

    reduce sequence : 64264631

  • 50

    Multiple Candidate Handles

    • If there are more than two handles? ..... “ambiguity, conflict ..”

    eg. E E + E | E * E | ( E ) | idid + id * id

    E => E + E E => E * E

    => E + E * E => E * id

    => E + E * id => E + E * id

    => E + id * id => E + id * id

    => id + id * id => id + id * id

  • 51

    Conflict Resolution

    • By defining priority

    – reduce/reduce : compare the priorities of production rules

    choose the one with higher priority

    – shift/reduce : compare the priority of the production rule to reduce and

    that of the input token

    reduce : the rule has higher priority than the input token

    shift : the input token has higher priority than the rule

    E E + E | E * E | num | (E)

    Priorities (to process *-operation first)

    1: E * E, *

    2: E +E, +

    1 + 2 * 3 Shift/reduce

    conflict

  • 52

    By defining priority (conts’)-Enforcing Associativity by Priorities

    – Left associativity: reduce first

    • the rule has higher priority than the input token

    – Right associativity: shift first

    • the input token has higher priority than the rule

    E E + E

    E num shift: 1+ (2+3)

    reduce: (1+2)+3

    1 + 2 + 3

    Priorities

    (for left-associativity)

    1: E + E

    2: +

    More convenient method? Yes, in parser generation tools

  • 53

    Conflict Relution CONTS’

    • But, how to resolve general conflicts?

    – When the stack top has and the input token is b, and we have the rule

    X where =

    • Should we make stack top X by reducing X ?

    • or b by pushing b?

    – When a stack top has = and = ’’ and we have both X andX’ ’

    • Which one should be the next handle or ’ ?

    • Solution

    – Using “parser states” : guiding information to select actions and

    handles

    – Much less conflicts

  • 54

    LR Parser

    • LR(k) – Left-to-right scan, right-most derivation, k lookahead characters

    – Basics: LR(0), LR(1) ...

    – Variations : SLR,... LALR(1)

    • Example– Yacc (recall “lex&yacc”) … we will see in the next chapter

  • 55

    Categories of Syntax Analysis Methods

    LR(0)

    SLR

    LALR(1)

    LR(1)

    LL(1)

    LR(k) LR(k+1)

    LL(k) LL(k+1)

    LL(k) LR(k)

    LR(0) SLR

    LALR(1) LR(1)