6fdqqlqj 2xwolqh - university of manitobamaheswar/cs329/lectures/lexical.pdf · 6fdqqlqj 2xwolqh 2....

1

6FDQQLQJ 2XWOLQH

◆ 2. Scanning➨ The basics

➨ Ad-hoc scanning➨ FSM based techniques➨ A Lexical Analysis tool - Lex (a scanner

generator)

5HFDOO� &RPSLOHU 6WUXFWXUH

/H[LFDO $QDO\VLV�6FDQQLQJ

6\QWD[ $QDO\VLV�3DUVLQJ

6HPDQWLF $QDO\VLV

0DFKLQH ,QGHSHQGHQW

2SWLPL]DWLRQ

&RGH *HQHUDWLRQ

0DFKLQH 'HSHQGHQW

2SWLPL]DWLRQ

6RXUFH &RGH

0DFKLQH &RGH

%DFN (QG

)URQW (QG

2

&RPSLOHU 6WUXFWXUH � $QRWKHU 9LHZ

/H[LFDO $QDO\]HU

6\QWD[ $QDO\]HU

&RGH *HQHUDWRU

2XWSXW /DQJXDJH

3KUDVH 6WUXFWXUH

/H[HPHV RU 7RNHQV

,QSXW /DQJXDJH

/H[LFDO $QDO\VLV � :KDW LV LW"

◆ The input to a compiler/interpreter is a sourceprogram which is “structured” as asequence/stream of characters

➨ or rather unstructured◆ Processing individual characters is pretty tedious

and highly inefficient

◆ As such, the first thing we have to do is addsome basic “structure” to the source code

3

/H[LFDO $QDO\VLV � :KDW LV LW"

◆ A Lexical Analyzer (a.k.a. scanner) convertsa stream of characters into a stream oftokens➨ i.e. they “tokenize” the input

◆ This is a many:1 transformation and thuslater phases of compilation will only need todeal with comparatively few tokens.

◆ A token (a.k.a. lexeme or syntactic unit) is afundamental component of a program

/H[LFDO $QDO\VLV �7RNHQV

◆ Tokens are typically the bottom level entitiesin syntax diagrams

◆ Typical tokens include:

➨ identifiers (e.g. variable names, etc.)➨ keywords➨ operators➨ literals (i.e. constant values)➨ punctuation

◆ Consider a simple program and its tokens:

4

/H[LFDO $QDO\VLV �7RNHQV

352*5$0 WHVW FU�OI

9$5 [ � ,17(*(5 � FU�OI

%(*,1 FU�OI

[ � [ � � � FU�OI

(1' � ^ WHVW `

NH\ZRUG�352*5$0� LGHQW�WHVW� NH\ZRUG�9$5� LGHQW�[� SXQFW��

NH\ZRUG�,17(*(5� SXQFW�� NH\ZRUG�%(*,1� LGHQW�[�

RSHUDWRU�� LGHQW�[� RSHUDWRU�� OLWHUDO�� SXQFW��

NH\ZRUG�(1'� SXQFW��

6RXUFH

&RGH

7RNHQV

2WKHU 6FDQQHU )XQFWLRQV

◆ A scanner also removes white space from aprogram

◆ white space consists of spaces, tabs,carriage returns, comments, and the like➨ stuff put into the source code solely for

readability which does not affect thefunctional specification provided by theprogram

◆ Some scanners also enter symbols in thesymbol table (more later)

5

$G KRF 6FDQQLQJ

◆ There are many applications outside of compilerconstruction that require simple scanningfunctions

➨ e.g. recognizing numeric values in financialand other applications

◆ These applications either implement their ownrecognition functions or rely on library routinesor language based pattern matching to providethe needed functionality

$G KRF 6FDQQLQJ

◆ Manual recognition of tokens involves amultitude of IF , WHILE, and SWITCH statements

➨ This approach is ugly, extremely tedious,highly error prone, and difficult to understand,maintain, and extend

◆ Using existing routines for doing patternmatching is a significant improvement

6

$G KRF 6FDQQLQJ

◆ In many cases (e.g. C language) thesefacilities are provided by library routines➨ #include <string.h>: index, strlen, strcat,

etc.

◆ In other cases (e.g. some variants of Pascal)they are incorporated into the language➨ substring functions, sets, etc.➨ or consider the language Perl!!!

◆ Both these reflect the prevalence andimportance of such functionality

$G KRF 6FDQQLQJ

◆ Anyone who has had to do a significant amountof such scanning/pattern matching knows howawkward it is

➨ e.g. consider data verification or theprocessing of command line arguments asother examples

◆ Scanning in a compiler/interpreter is typically farworse➨ Even simple languages have complex

lexemes

7

*UDPPDUV

◆ A “generative grammar” is a set of rules togenerate valid phrases in a particular language

◆ Grammar G = {V, T, P, S}; V - finite set of non-terminals or variables, T -finite set of terminals ortokens, P - finite set of productions, S - is a non-terminal called “start symbol”

◆ Noam Chomsky defined classes of “complexity”of generative grammars

◆ The hierarchy of four classes, each of whichproperly contains the next is called the Chomskyhierarchy

*UDPPDUV � &KRPVN\ KLHUDUFK\

Type 0Unrestricted Grammars

Type 1Context-Sensitive Grammars (CSGs)

Type 2Context-Free Grammars (CFGs)

Type 3Regular Grammars (RGs)

8

8QUHVWULFWHG *UDPPDUV

◆ This type of grammar is too complex forprogramming languages -- cannot constructefficient parsers for this type of grammar

◆ This grammar consists of productions of theform α →β

&RQWH[W�6HQVLWLYH *UDPPDUV

◆ Most computer languages fall into this class ofgrammars

◆ The productions in this class are of the formα1Αα2 →α1βα2

◆ “A becomes β in the context of α1 and α2” -- ingeneral these grammars are still too complex forefficient computer analysis

◆ The context sensitivity of the programminglanguages is handled by other means so thatcontext free grammars can be used forprogramming languages

9

&RQWH[W )UHH *UDPPDUV

◆ A production of a context free grammar (CFG) isof the form Α→α, where Α is a variable and α isa string of symbols

◆ In CFGs, the derivations are on variables areindependent of what surrounds them

◆ To generate phrases in the language, strings ofterminals are derived by repeated expansion ofnon-terminals

◆ CFGs permit the construction of efficient syntaxanalyzers

&RQWH[W )UHH *UDPPDUV

◆ Example:

<S> → a <A> b

<A> → <B> c

<B> → d

◆ Language generated by the above grammar isadcb

Productions of the grammar

10

5HJXODU *UDPPDUV

◆ If all the productions of a CFG are of the form

➨ Α→ωΒ or Α→ω, where Α, Β are non-terminalsand ω is a string of terminals (possibly empty)

➨ Α→Βω or Α→ω, where Α, Β are non-terminalsand ω is a string of terminals (possibly empty)

◆ Then the grammar is a RG -- first form is called“Right linear” and the second form is called “Leftlinear”

◆ RGs are too restrictive for most purposes

◆ Very efficient parsers can be built

5HJXODU *UDPPDUV

◆ The reason for the efficiency is that the languagegeneration from RG can be performed withoutremembering our current position in theproduction that is currently being expanded

◆ Lack of memory makes RGs incapable ofgenerating languages with arbitrarily nestedstructures

◆ In compilers, RGs will be used to describe“words” and CFGs will be used to describephrases constructed from these words

11

5HJXODU ([SUHVVLRQV � 5(V

◆ Regular expressions are a simplified form ofgrammar used to represent RGs

➨ ε (epsilon - empty set) is a regular expressionthat matches nothing

➨ symbol (terminal) s in the language is a REthat matches s

➨ if R is a RE, (R)* matches zero or moreoccurrences of the pattern R - known as theclosure of R

➨ if R is a RE, (R)+ matches one or moreoccurrences of the pattern R

5HJXODU ([SUHVVLRQV � 5(V

➨ If R and S are RE, (R)|(S) matches either thepattern R or the pattern S -- alternation

➨ If R and S are RE, (R)(S) matches thecatenation of pattern R followed by pattern S

◆ Example

◆ <int> ::= (0|1|2|3|4|5|6|7|8|9)+

◆ <int_no_leading_zero> ::=

(1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9)*

12

%HWWHU 6FDQQLQJ 7HFKQLTXHV

◆ This has motivated the development of bothtechniques and tools for doing scanning

◆ The most common of these are based onwhat are known as finite state machines(FSMs) which recognize regular languages

◆ The key to being able to do this is theexistence of certain restrictions placed on theformat of programming languages➨ E.g.; tokens are usually separated by

delimiters

)60�EDVHG 6FDQQLQJ

◆ The most common techniques used forbuilding scanners are based on finite statemachines(or FSMs)

◆ FSMs can be easily used to recognizelanguage constructs (tokens) which aredescribed by regular languages

13

5HJXODU /DQJXDJHV �5HYLVWHG

◆ A regular language is one which is composed ofregular expressions

◆ A regular expression consists of simple, atomicelements combined using only three operations➨ catenation,➨ alternation, and➨ repetition


◆ Catenation (a.k.a. concatenation or sequencing)is represented by physical adjacency➨ e.g. the regular expression ‘<letter> <digit>’

simply represents (depending on the definitionof letter and digit) a sequence composed of aletter followed by a digit

● we would use the “::=” (equivalence)operator to associated a definition with<letter> or <digit>

14


◆ Alternation allows selection from a number ofchoices and is commonly represented by the ‘|’operator

➨ E.g. <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

◆ Certain shorthand forms are also commonlyused with alternation (especially ellipses)➨ E.g. <alpha> ::= a | b | … | z | A | B | … | Z


◆ Finally, repetition permits the expression ofconstructs which are to be repeated somenumber of times

◆ There are two operators used for this purpose:superscript ‘+’, and superscript ‘*’➨ E.g. <word> ::= <letter>+

● this implies 1 or more letters (* would imply0 or more letters)

15


◆ Finally, parenthesis ( ‘(’ and ‘)’) are used forgrouping regular expressions

◆ Normally, the repetition operators have thehighest precedence followed by alternation andthen followed by catenation

◆ These 3 simple operations permit us to easilyexpress the tokens that occur in existingprogramming languages


◆ Consider the following regular expressions for afew common tokens and token types we mightencounter

➨ <assignop> ::= ‘:=’➨ <alphanum> ::= <alpha> | <digit>➨ <ident> ::= (<alpha> | ‘_’ | ‘$’) <alphanum>*

➨ <intconst> ::= <digit>+

◆ Not everything is this simple to specify

1RWH WKH XVH RI TXRWHV

16


◆ For this reason, there are a couple of other“short cuts” that make life easier

◆ These are notational conveniences only andcan easily be represented using the basicconstructs

◆ Logical Negation (‘^’ or ‘~’)➨ commonly used with other constructs➨ <comment> ::= ‘{‘ (~’}’)* ‘}’


➨ ~a implies anything in U-{a}➨ Negation can be done simply by enumerating

everything in U-{a}

● e.g. if U={a b c d e} then we could write(~a)* or, alternatively, (b | c | d | e)*

◆ Optional Constructs➨ sometime it becomes tedious to list a number

of similar options which could be moreconveniently expressed by saying someconstructs are optional

17


➨ The most common notation for an optionalconstruct is the use of braces

● E.g. <signedintconst> ::= [+ | -]<intconst>

➨ The preceding example is equivalent to thefollowing:

● <signedintconst> ::= <intconst> | ‘+’<intconst> | ‘-’ <intconst>

➨ If we could specify the number of times arepetition could take place we could do itanother way too


◆ Consider:

➨ <signedintconst> ::= (+ | -)0..1 <intconst>

● The 0..1 is intended to imply that repetitioncan take place at most once (0 or 1 times)

◆ This illustrates yet another possible constructwhich, like the others, may be expressed usingonly catenation, alternation, and replication

➨ albeit more verbosely

18


◆ Let’s try something a bit more challenging:

◆ What does a real constant look like?

➨ It might have a sign for the mantissa➨ The mantissa consists of some digits followed

by a decimal point possibly followed by somemore digits (the fractional part)

➨ There might be an exponent as well whichcould be signed


◆ Let’s do this in pieces...➨ <realconst> ::= <mantissa> [ ‘E’ <exponent>]

◆ Consider the exponent first - its just a signedinteger constant:➨ <exponent> ::= [+ | -] <intconst>

◆ where➨ <intconst> ::= <digit>+

➨ <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

19


◆ Now let’s try the mantissa…➨ <mantissa> ::= [+ | -] <intconst> ‘.’ [

intconst]

◆ As with programming, divide and conquerworks well to handle the complexity of regularexpression specification

◆ Also, the use of the “optional” constructsgreatly simplifies this specification➨ As an exercise, try doing the real constant

without ‘[’ and ‘]’

5HJXODU /DQJXDJHV )60V

◆ A good way to start developing a scanner is toproduce regular expressions for the tokens youwish to recognize

◆ The regular expressions themselves, however,are not the basis of the scanning process

◆ This requires a Finite State Machine (FSM)specification

20

)LQLWH 6WDWH 0DFKLQHV

◆ Fortunately, there is a direct 1:1 mappingbetween regular expressions and the FSMsthat “implement” them

◆ An FSM is an abstract machine which can bein one of a finite number of states, whichmakes state transitions based on inputs, andwhich performs specific actions in specificstates or on transitions between states➨ Moore and Mealy machines from digital

logic


◆ FSMs are commonly represented graphically➨ Nodes in the graph represent individual states

and are assigned meaningful names

➨ Edges represent transitions between thestates and are labeled with the input valueswhich cause the state transitions

◆ An FSM-based scanner takes its input from thesource code character stream

21


◆ The FSM-based scanner performs certainactions which include recognizing specificcharacters, accumulating the characters in aparticular token, and returning completed tokensto form the output token stream

◆ We’ll begin by just recognizing some simpletokens and worry about actually building thetokens later


◆ <digit> ::= 0 | 1 | … | 9

◆ <intconst> ::= <digit>+

intconst

0..9

other0..9

22


◆ <ident> ::= (<alpha> | ‘_’ | ‘$’) <alphanum>*

ident

alphanum

otheralpha,_,$

(TXLYDOHQFH RI 5(V DQG )60V

◆ For each regular expression (RE), there is anFSM that recognizes strings conforming to theregular expression

◆ Consider the three basic RE operations

◆ Catenation: a b

a bstart done

23

(TXLYDOHQFH RI 5(V DQG )60V

◆ Alternation: a | b | c

◆ Repetition: a*

start done

a

b

c

start donea

a

U-{a}

ε

$ 6DPSOH 5HJXODU /DQJXDJH

<comment> ::= ‘{‘ (~’}’)* ‘}’<letter> ::= ‘a’ | … | ‘z’ | ‘A’ | … | ‘Z’<digit> ::= ‘0’ | … | ‘9’<ident> ::= <letter> (<letter> | digit>)*

<numconst> ::= <digit>+ [ ‘.’ <digit>+ ]<strconst> ::= ‘ “ ’ (~’ ” ’)* ‘ ” ’<assignop> ::= ‘:=’ | ‘:+=’ | ‘:-=’ | ‘:*=’ | ‘:/=’<negop> ::= ‘~’ | ‘~<’ | ‘~>’ | ‘~=’

24

$ 6DPSOH )60 IRU WKH ODQJXDJH

CommentLeading

Lit 1

Assign?

Assign!

Neg

Finish

Lit 3

Lit 2

Lit 4

Ident

}

{ other

letter

digit

“

:

~

digitother

otherletter, digit

.

digit

digit

other

+-*/==

other

other

><=

other

”

; , . [ ]

%XLOGLQJ D 6FDQQHU

◆ How does a scanner interact with the parser?

◆ Consider the following:

LexicalAnalyzer

SyntaxAnalyzer

SourceProgram

token

get nexttoken()

ParseTree

25

6FDQQHU $FWLRQV

◆ As the scanner changes from state to state, itmust do something with the characters it scansin order to build the tokens to be returned to theparser calling it

◆ In some cases, it must append the characterseen onto a developing token and consume it sothe next input character is visible➨ E.g. when scanning characters in an identifier

6FDQQHU $FWLRQV

◆ In other cases it must preserve the characterand return a completed token➨ E.g. MaxVal := -999;

● After scanning the ‘:’ we know that wehave found the end of the identifier‘MaxVal’ so we want to return that to theparser but we do not want to lose the ‘:’so we must preserve it

◆ Another possible action is to simply consumea character

➨ E.g. characters in comments

26

,PSOHPHQWLQJ WKH )60

◆ A finite state machine may be easilyimplemented using a table driven technique

◆ Table driven techniques are highly methodical

➨ Comparatively easy to handle changes and/orextensions to the grammar

➨ Straightforward code that is not error-prone➨ Easy to maintain the code

,PSOHPHQWLQJ WKH )60

◆ Regard the scanner as a device which takes acharacter stream as input and produces a tokenstream as output.

◆ At any given point in time...➨ The device is in a specific state➨ Based on the current state and the next input

character, it will● perform a specific action, and● move into a new (possibly different) state

27

6FDQQHU $FWLRQV � GHWDLO◆ Typical actions include:

● C : Consume

● AC : Append and Consume● PI : Preserve and build ID token● PL: Preserve and build Literal token● PK : Preserve and build Keyword token● PP : Preserve and build Punctuation

token● CO : Consume and build Operator token● CL : Consume and build Literal token

➨ What actions you need depends on the

6DPSOH )60 ZLWK DFWLRQV

CommentLeading

Lit 1

Assign?

Assign!

Neg

Finish

Lit 3

Lit 2

Lit 4

Ident

} C

{ C

other C

letter AC

digit AC

“AC

: AC

~AC

digit AC other

PL

other PIletter, digit

AC

. AC

digit AC

digit AC

other PL

+-*/

AC

= CO= CO

other PP

other PO

><= CO

other AC ” CL

; , . [ ] CP

28

$ VFDQQHU PDLQOLQH

STATIC GLOBAL ipchar;

GLOBAL str, token, preserve

str = “ ”

state = Leading

WHILE (state <> Finish) DO

preserve = NO

CALL action[state,ipchar]

state = nextstate[state,ipchar]

IF NOT preserve THEN

ipchar = getchar()

RETURN(token)

$FWLRQ WDEOH

Current Input Character State <alpha> <digit> . “ + : = { etc.

1. Leading AC AC CP AC CO AC CO C 2. Comment C C C 3. Ident AC AC PI PI PI PI PI PI 4. Lit 1 PL AC 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign!10. Neg11. Finish

etc.

29

1H[W 6WDWH WDEOH

Current Input Character State <alpha> <digit> . “ + : = { etc.

1. Leading 3 4 11 7 11 8 11 2 2. Comment 3. Ident 3 3 11 11 11 11 11 11 4. Lit 1 11 3 5 11 11 11 11 11 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign!10. Neg11. Finish

etc.

$GGLWLRQDO FRGH

◆ All we have to do now is add action routines➨ “append” adds the current character onto a

string representing the token beingrecognized

➨ “consume” vs. “preserve” is handled by thepreserve flag

30

$ /H[LFDO $QDO\]HU *HQHUDWRU

◆ Building a scanner manually (even using theFSM technique) is tedious

◆ We know that the mapping from regularexpressions to FSM is straightforward so whydon’t we automate the process?

◆ Then we just type in regular expressions and getback code to implement a scanner

◆ That is exactly what ‘lex ’ does

+RZ lex ZRUNV

LexCompiler

CCompiler

a.out

LexSource

Programlex.l

lex.yy.c

inputstream

lex.yy.c

a.out

sequenceof

tokens

31

OH[ 6SHFLILFDWLRQV

◆ lex programs are divided into three componentsdeclarations - variable defined,

include files specified, etc

%%

translation rules

pattern action

(using REs) { C/C++ statements}

%%

auxiliary procedures -- supportroutines for the C/C++ statementsabove

6DPSOH lex SURJUDP%{/* * this sample demonstrates (very) simple

recognition: * a verb/not a verb. */

/* include’s and define’s should go in this section*/

%}%%

32

6DPSOH lex SURJUDP[\t ]+ /* ignore white space */ ;

is |am |are |were |was |be |being |been |do |does |did |have |had |go { printf("%s: is a verb\n", yytext); }

6DPSOH lex SURJUDP[a-zA-Z]+ { printf("%s: is not a verb\n", yytext);

}

.|\n { ECHO; /* normal default anyway */ }%%

main(){

yylex();}

6fdqqlqj 2xwolqh - university of manitobamaheswar/cs329/lectures/lexical.pdf · 6fdqqlqj 2xwolqh 2....

Documents