6fdqqlqj 2xwolqh - university of manitobamaheswar/cs329/lectures/lexical.pdf · 6fdqqlqj 2xwolqh 2....
TRANSCRIPT
1
6FDQQLQJ 2XWOLQH
◆ 2. Scanning➨ The basics
➨ Ad-hoc scanning➨ FSM based techniques➨ A Lexical Analysis tool - Lex (a scanner
generator)
5HFDOO� &RPSLOHU 6WUXFWXUH
/H[LFDO $QDO\VLV�6FDQQLQJ
6\QWD[ $QDO\VLV�3DUVLQJ
6HPDQWLF $QDO\VLV
0DFKLQH ,QGHSHQGHQW
2SWLPL]DWLRQ
&RGH *HQHUDWLRQ
0DFKLQH 'HSHQGHQW
2SWLPL]DWLRQ
6RXUFH &RGH
0DFKLQH &RGH
%DFN (QG
)URQW (QG
2
&RPSLOHU 6WUXFWXUH � $QRWKHU 9LHZ
/H[LFDO $QDO\]HU
6\QWD[ $QDO\]HU
&RGH *HQHUDWRU
2XWSXW /DQJXDJH
3KUDVH 6WUXFWXUH
/H[HPHV RU 7RNHQV
,QSXW /DQJXDJH
/H[LFDO $QDO\VLV � :KDW LV LW"
◆ The input to a compiler/interpreter is a sourceprogram which is “structured” as asequence/stream of characters
➨ or rather unstructured◆ Processing individual characters is pretty tedious
and highly inefficient
◆ As such, the first thing we have to do is addsome basic “structure” to the source code
3
/H[LFDO $QDO\VLV � :KDW LV LW"
◆ A Lexical Analyzer (a.k.a. scanner) convertsa stream of characters into a stream oftokens➨ i.e. they “tokenize” the input
◆ This is a many:1 transformation and thuslater phases of compilation will only need todeal with comparatively few tokens.
◆ A token (a.k.a. lexeme or syntactic unit) is afundamental component of a program
/H[LFDO $QDO\VLV �7RNHQV
◆ Tokens are typically the bottom level entitiesin syntax diagrams
◆ Typical tokens include:
➨ identifiers (e.g. variable names, etc.)➨ keywords➨ operators➨ literals (i.e. constant values)➨ punctuation
◆ Consider a simple program and its tokens:
4
/H[LFDO $QDO\VLV �7RNHQV
352*5$0 WHVW FU�OI
9$5 [ � ,17(*(5 � FU�OI
%(*,1 FU�OI
[ � [ � � � FU�OI
(1' � ^ WHVW `
NH\ZRUG�352*5$0� LGHQW�WHVW� NH\ZRUG�9$5� LGHQW�[� SXQFW���
NH\ZRUG�,17(*(5� SXQFW��� NH\ZRUG�%(*,1� LGHQW�[�
RSHUDWRU�� � LGHQW�[� RSHUDWRU��� OLWHUDO��� SXQFW���
NH\ZRUG�(1'� SXQFW���
6RXUFH
&RGH
7RNHQV
2WKHU 6FDQQHU )XQFWLRQV
◆ A scanner also removes white space from aprogram
◆ white space consists of spaces, tabs,carriage returns, comments, and the like➨ stuff put into the source code solely for
readability which does not affect thefunctional specification provided by theprogram
◆ Some scanners also enter symbols in thesymbol table (more later)
5
$G KRF 6FDQQLQJ
◆ There are many applications outside of compilerconstruction that require simple scanningfunctions
➨ e.g. recognizing numeric values in financialand other applications
◆ These applications either implement their ownrecognition functions or rely on library routinesor language based pattern matching to providethe needed functionality
$G KRF 6FDQQLQJ
◆ Manual recognition of tokens involves amultitude of IF , WHILE, and SWITCH statements
➨ This approach is ugly, extremely tedious,highly error prone, and difficult to understand,maintain, and extend
◆ Using existing routines for doing patternmatching is a significant improvement
6
$G KRF 6FDQQLQJ
◆ In many cases (e.g. C language) thesefacilities are provided by library routines➨ #include <string.h>: index, strlen, strcat,
etc.
◆ In other cases (e.g. some variants of Pascal)they are incorporated into the language➨ substring functions, sets, etc.➨ or consider the language Perl!!!
◆ Both these reflect the prevalence andimportance of such functionality
$G KRF 6FDQQLQJ
◆ Anyone who has had to do a significant amountof such scanning/pattern matching knows howawkward it is
➨ e.g. consider data verification or theprocessing of command line arguments asother examples
◆ Scanning in a compiler/interpreter is typically farworse➨ Even simple languages have complex
lexemes
7
*UDPPDUV
◆ A “generative grammar” is a set of rules togenerate valid phrases in a particular language
◆ Grammar G = {V, T, P, S}; V - finite set of non-terminals or variables, T -finite set of terminals ortokens, P - finite set of productions, S - is a non-terminal called “start symbol”
◆ Noam Chomsky defined classes of “complexity”of generative grammars
◆ The hierarchy of four classes, each of whichproperly contains the next is called the Chomskyhierarchy
*UDPPDUV � &KRPVN\ KLHUDUFK\
Type 0Unrestricted Grammars
Type 1Context-Sensitive Grammars (CSGs)
Type 2Context-Free Grammars (CFGs)
Type 3Regular Grammars (RGs)
8
8QUHVWULFWHG *UDPPDUV
◆ This type of grammar is too complex forprogramming languages -- cannot constructefficient parsers for this type of grammar
◆ This grammar consists of productions of theform α →β
&RQWH[W�6HQVLWLYH *UDPPDUV
◆ Most computer languages fall into this class ofgrammars
◆ The productions in this class are of the formα1Αα2 →α1βα2
◆ “A becomes β in the context of α1 and α2” -- ingeneral these grammars are still too complex forefficient computer analysis
◆ The context sensitivity of the programminglanguages is handled by other means so thatcontext free grammars can be used forprogramming languages
9
&RQWH[W )UHH *UDPPDUV
◆ A production of a context free grammar (CFG) isof the form Α→α, where Α is a variable and α isa string of symbols
◆ In CFGs, the derivations are on variables areindependent of what surrounds them
◆ To generate phrases in the language, strings ofterminals are derived by repeated expansion ofnon-terminals
◆ CFGs permit the construction of efficient syntaxanalyzers
&RQWH[W )UHH *UDPPDUV
◆ Example:
<S> → a <A> b
<A> → <B> c
<B> → d
◆ Language generated by the above grammar isadcb
Productions of the grammar
10
5HJXODU *UDPPDUV
◆ If all the productions of a CFG are of the form
➨ Α→ωΒ or Α→ω, where Α, Β are non-terminalsand ω is a string of terminals (possibly empty)
➨ Α→Βω or Α→ω, where Α, Β are non-terminalsand ω is a string of terminals (possibly empty)
◆ Then the grammar is a RG -- first form is called“Right linear” and the second form is called “Leftlinear”
◆ RGs are too restrictive for most purposes
◆ Very efficient parsers can be built
5HJXODU *UDPPDUV
◆ The reason for the efficiency is that the languagegeneration from RG can be performed withoutremembering our current position in theproduction that is currently being expanded
◆ Lack of memory makes RGs incapable ofgenerating languages with arbitrarily nestedstructures
◆ In compilers, RGs will be used to describe“words” and CFGs will be used to describephrases constructed from these words
11
5HJXODU ([SUHVVLRQV � 5(V
◆ Regular expressions are a simplified form ofgrammar used to represent RGs
➨ ε (epsilon - empty set) is a regular expressionthat matches nothing
➨ symbol (terminal) s in the language is a REthat matches s
➨ if R is a RE, (R)* matches zero or moreoccurrences of the pattern R - known as theclosure of R
➨ if R is a RE, (R)+ matches one or moreoccurrences of the pattern R
5HJXODU ([SUHVVLRQV � 5(V
➨ If R and S are RE, (R)|(S) matches either thepattern R or the pattern S -- alternation
➨ If R and S are RE, (R)(S) matches thecatenation of pattern R followed by pattern S
◆ Example
◆ <int> ::= (0|1|2|3|4|5|6|7|8|9)+
◆ <int_no_leading_zero> ::=
(1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9)*
12
%HWWHU 6FDQQLQJ 7HFKQLTXHV
◆ This has motivated the development of bothtechniques and tools for doing scanning
◆ The most common of these are based onwhat are known as finite state machines(FSMs) which recognize regular languages
◆ The key to being able to do this is theexistence of certain restrictions placed on theformat of programming languages➨ E.g.; tokens are usually separated by
delimiters
)60�EDVHG 6FDQQLQJ
◆ The most common techniques used forbuilding scanners are based on finite statemachines(or FSMs)
◆ FSMs can be easily used to recognizelanguage constructs (tokens) which aredescribed by regular languages
13
5HJXODU /DQJXDJHV �5HYLVWHG
◆ A regular language is one which is composed ofregular expressions
◆ A regular expression consists of simple, atomicelements combined using only three operations➨ catenation,➨ alternation, and➨ repetition
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Catenation (a.k.a. concatenation or sequencing)is represented by physical adjacency➨ e.g. the regular expression ‘<letter> <digit>’
simply represents (depending on the definitionof letter and digit) a sequence composed of aletter followed by a digit
● we would use the “::=” (equivalence)operator to associated a definition with<letter> or <digit>
14
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Alternation allows selection from a number ofchoices and is commonly represented by the ‘|’operator
➨ E.g. <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
◆ Certain shorthand forms are also commonlyused with alternation (especially ellipses)➨ E.g. <alpha> ::= a | b | … | z | A | B | … | Z
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Finally, repetition permits the expression ofconstructs which are to be repeated somenumber of times
◆ There are two operators used for this purpose:superscript ‘+’, and superscript ‘*’➨ E.g. <word> ::= <letter>+
● this implies 1 or more letters (* would imply0 or more letters)
15
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Finally, parenthesis ( ‘(’ and ‘)’) are used forgrouping regular expressions
◆ Normally, the repetition operators have thehighest precedence followed by alternation andthen followed by catenation
◆ These 3 simple operations permit us to easilyexpress the tokens that occur in existingprogramming languages
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Consider the following regular expressions for afew common tokens and token types we mightencounter
➨ <assignop> ::= ‘:=’➨ <alphanum> ::= <alpha> | <digit>➨ <ident> ::= (<alpha> | ‘_’ | ‘$’) <alphanum>*
➨ <intconst> ::= <digit>+
◆ Not everything is this simple to specify
1RWH WKH XVH RI TXRWHV
16
5HJXODU /DQJXDJHV �5HYLVWHG
◆ For this reason, there are a couple of other“short cuts” that make life easier
◆ These are notational conveniences only andcan easily be represented using the basicconstructs
◆ Logical Negation (‘^’ or ‘~’)➨ commonly used with other constructs➨ <comment> ::= ‘{‘ (~’}’)* ‘}’
5HJXODU /DQJXDJHV �5HYLVWHG
➨ ~a implies anything in U-{a}➨ Negation can be done simply by enumerating
everything in U-{a}
● e.g. if U={a b c d e} then we could write(~a)* or, alternatively, (b | c | d | e)*
◆ Optional Constructs➨ sometime it becomes tedious to list a number
of similar options which could be moreconveniently expressed by saying someconstructs are optional
17
5HJXODU /DQJXDJHV �5HYLVWHG
➨ The most common notation for an optionalconstruct is the use of braces
● E.g. <signedintconst> ::= [+ | -]<intconst>
➨ The preceding example is equivalent to thefollowing:
● <signedintconst> ::= <intconst> | ‘+’<intconst> | ‘-’ <intconst>
➨ If we could specify the number of times arepetition could take place we could do itanother way too
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Consider:
➨ <signedintconst> ::= (+ | -)0..1 <intconst>
● The 0..1 is intended to imply that repetitioncan take place at most once (0 or 1 times)
◆ This illustrates yet another possible constructwhich, like the others, may be expressed usingonly catenation, alternation, and replication
➨ albeit more verbosely
18
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Let’s try something a bit more challenging:
◆ What does a real constant look like?
➨ It might have a sign for the mantissa➨ The mantissa consists of some digits followed
by a decimal point possibly followed by somemore digits (the fractional part)
➨ There might be an exponent as well whichcould be signed
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Let’s do this in pieces...➨ <realconst> ::= <mantissa> [ ‘E’ <exponent>]
◆ Consider the exponent first - its just a signedinteger constant:➨ <exponent> ::= [+ | -] <intconst>
◆ where➨ <intconst> ::= <digit>+
➨ <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
19
5HJXODU /DQJXDJHV �5HYLVWHG
◆ Now let’s try the mantissa…➨ <mantissa> ::= [+ | -] <intconst> ‘.’ [
intconst]
◆ As with programming, divide and conquerworks well to handle the complexity of regularexpression specification
◆ Also, the use of the “optional” constructsgreatly simplifies this specification➨ As an exercise, try doing the real constant
without ‘[’ and ‘]’
5HJXODU /DQJXDJHV )60V
◆ A good way to start developing a scanner is toproduce regular expressions for the tokens youwish to recognize
◆ The regular expressions themselves, however,are not the basis of the scanning process
◆ This requires a Finite State Machine (FSM)specification
20
)LQLWH 6WDWH 0DFKLQHV
◆ Fortunately, there is a direct 1:1 mappingbetween regular expressions and the FSMsthat “implement” them
◆ An FSM is an abstract machine which can bein one of a finite number of states, whichmakes state transitions based on inputs, andwhich performs specific actions in specificstates or on transitions between states➨ Moore and Mealy machines from digital
logic
)LQLWH 6WDWH 0DFKLQHV
◆ FSMs are commonly represented graphically➨ Nodes in the graph represent individual states
and are assigned meaningful names
➨ Edges represent transitions between thestates and are labeled with the input valueswhich cause the state transitions
◆ An FSM-based scanner takes its input from thesource code character stream
21
)LQLWH 6WDWH 0DFKLQHV
◆ The FSM-based scanner performs certainactions which include recognizing specificcharacters, accumulating the characters in aparticular token, and returning completed tokensto form the output token stream
◆ We’ll begin by just recognizing some simpletokens and worry about actually building thetokens later
)LQLWH 6WDWH 0DFKLQHV
◆ <digit> ::= 0 | 1 | … | 9
◆ <intconst> ::= <digit>+
intconst
0..9
other0..9
22
)LQLWH 6WDWH 0DFKLQHV
◆ <ident> ::= (<alpha> | ‘_’ | ‘$’) <alphanum>*
ident
alphanum
otheralpha,_,$
(TXLYDOHQFH RI 5(V DQG )60V
◆ For each regular expression (RE), there is anFSM that recognizes strings conforming to theregular expression
◆ Consider the three basic RE operations
◆ Catenation: a b
a bstart done
23
(TXLYDOHQFH RI 5(V DQG )60V
◆ Alternation: a | b | c
◆ Repetition: a*
start done
a
b
c
start donea
a
U-{a}
ε
$ 6DPSOH 5HJXODU /DQJXDJH
<comment> ::= ‘{‘ (~’}’)* ‘}’<letter> ::= ‘a’ | … | ‘z’ | ‘A’ | … | ‘Z’<digit> ::= ‘0’ | … | ‘9’<ident> ::= <letter> (<letter> | digit>)*
<numconst> ::= <digit>+ [ ‘.’ <digit>+ ]<strconst> ::= ‘ “ ’ (~’ ” ’)* ‘ ” ’<assignop> ::= ‘:=’ | ‘:+=’ | ‘:-=’ | ‘:*=’ | ‘:/=’<negop> ::= ‘~’ | ‘~<’ | ‘~>’ | ‘~=’
24
$ 6DPSOH )60 IRU WKH ODQJXDJH
CommentLeading
Lit 1
Assign?
Assign!
Neg
Finish
Lit 3
Lit 2
Lit 4
Ident
}
{ other
letter
digit
“
:
~
digitother
otherletter, digit
.
digit
digit
other
+-*/==
other
other
><=
other
”
; , . [ ]
%XLOGLQJ D 6FDQQHU
◆ How does a scanner interact with the parser?
◆ Consider the following:
LexicalAnalyzer
SyntaxAnalyzer
SourceProgram
token
get nexttoken()
ParseTree
25
6FDQQHU $FWLRQV
◆ As the scanner changes from state to state, itmust do something with the characters it scansin order to build the tokens to be returned to theparser calling it
◆ In some cases, it must append the characterseen onto a developing token and consume it sothe next input character is visible➨ E.g. when scanning characters in an identifier
6FDQQHU $FWLRQV
◆ In other cases it must preserve the characterand return a completed token➨ E.g. MaxVal := -999;
● After scanning the ‘:’ we know that wehave found the end of the identifier‘MaxVal’ so we want to return that to theparser but we do not want to lose the ‘:’so we must preserve it
◆ Another possible action is to simply consumea character
➨ E.g. characters in comments
26
,PSOHPHQWLQJ WKH )60
◆ A finite state machine may be easilyimplemented using a table driven technique
◆ Table driven techniques are highly methodical
➨ Comparatively easy to handle changes and/orextensions to the grammar
➨ Straightforward code that is not error-prone➨ Easy to maintain the code
,PSOHPHQWLQJ WKH )60
◆ Regard the scanner as a device which takes acharacter stream as input and produces a tokenstream as output.
◆ At any given point in time...➨ The device is in a specific state➨ Based on the current state and the next input
character, it will● perform a specific action, and● move into a new (possibly different) state
27
6FDQQHU $FWLRQV � GHWDLO◆ Typical actions include:
● C : Consume
● AC : Append and Consume● PI : Preserve and build ID token● PL: Preserve and build Literal token● PK : Preserve and build Keyword token● PP : Preserve and build Punctuation
token● CO : Consume and build Operator token● CL : Consume and build Literal token
➨ What actions you need depends on the
6DPSOH )60 ZLWK DFWLRQV
CommentLeading
Lit 1
Assign?
Assign!
Neg
Finish
Lit 3
Lit 2
Lit 4
Ident
} C
{ C
other C
letter AC
digit AC
“AC
: AC
~AC
digit AC other
PL
other PIletter, digit
AC
. AC
digit AC
digit AC
other PL
+-*/
AC
= CO= CO
other PP
other PO
><= CO
other AC ” CL
; , . [ ] CP
28
$ VFDQQHU PDLQOLQH
STATIC GLOBAL ipchar;
GLOBAL str, token, preserve
str = “ ”
state = Leading
WHILE (state <> Finish) DO
preserve = NO
CALL action[state,ipchar]
state = nextstate[state,ipchar]
IF NOT preserve THEN
ipchar = getchar()
RETURN(token)
$FWLRQ WDEOH
Current Input Character State <alpha> <digit> . “ + : = { etc.
1. Leading AC AC CP AC CO AC CO C 2. Comment C C C 3. Ident AC AC PI PI PI PI PI PI 4. Lit 1 PL AC 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign!10. Neg11. Finish
etc.
29
1H[W 6WDWH WDEOH
Current Input Character State <alpha> <digit> . “ + : = { etc.
1. Leading 3 4 11 7 11 8 11 2 2. Comment 3. Ident 3 3 11 11 11 11 11 11 4. Lit 1 11 3 5 11 11 11 11 11 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign!10. Neg11. Finish
etc.
$GGLWLRQDO FRGH
◆ All we have to do now is add action routines➨ “append” adds the current character onto a
string representing the token beingrecognized
➨ “consume” vs. “preserve” is handled by thepreserve flag
30
$ /H[LFDO $QDO\]HU *HQHUDWRU
◆ Building a scanner manually (even using theFSM technique) is tedious
◆ We know that the mapping from regularexpressions to FSM is straightforward so whydon’t we automate the process?
◆ Then we just type in regular expressions and getback code to implement a scanner
◆ That is exactly what ‘lex ’ does
+RZ lex ZRUNV
LexCompiler
CCompiler
a.out
LexSource
Programlex.l
lex.yy.c
inputstream
lex.yy.c
a.out
sequenceof
tokens
31
OH[ 6SHFLILFDWLRQV
◆ lex programs are divided into three componentsdeclarations - variable defined,
include files specified, etc
%%
translation rules
pattern action
(using REs) { C/C++ statements}
%%
auxiliary procedures -- supportroutines for the C/C++ statementsabove
6DPSOH lex SURJUDP%{/* * this sample demonstrates (very) simple
recognition: * a verb/not a verb. */
/* include’s and define’s should go in this section*/
%}%%
32
6DPSOH lex SURJUDP[\t ]+ /* ignore white space */ ;
is |am |are |were |was |be |being |been |do |does |did |have |had |go { printf("%s: is a verb\n", yytext); }
6DPSOH lex SURJUDP[a-zA-Z]+ { printf("%s: is not a verb\n", yytext);
}
.|\n { ECHO; /* normal default anyway */ }%%
main(){
yylex();}