chapter 2. regular expressions and automata 2.1 regular expressions
DESCRIPTION
Chapter 2. Regular Expressions and Automata 2.1 Regular Expressions. 2007 년 3 월 30 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 21 ~ 33. Outline. Introduction Basic Regular Expression Patterns Disjunction, Grouping, and Precedence A Simple Example - PowerPoint PPT PresentationTRANSCRIPT
Chapter 2. Regular Expressions and Automata2.1 Regular Expressions
2007 년 3 월 30 일
부산대학교 인공지능연구실 김민호
Text : Speech and Language ProcessingPage. 21 ~ 33
Outline
Introduction Basic Regular Expression Patterns Disjunction, Grouping, and Precedence A Simple Example A More Complex Example Advanced Operators Regular Expression Substitution, Memory, and
ELIZA
Introduction
One of the unsung successes in standardization in computer science
a language for specifying text search strings an algebraic notation for characterizing a set of
strings regular expression search
requires a pattern that we want to search function will search through the corpus returning all texts
that contain the pattern
3 / 20
Basic Regular Expression Patterns (1/6)
metacharacter the slash /
metacharacter the square bracket [ ]
4/ 20
Basic Regular Expression Patterns (2/6)
metacharacter the dash – / [123456789] /
/ [1-9] / / [ABCDEFGHIJKLMNOPQRSTUVWXYZ] /
/ [A-Z] /
5 / 20
Basic Regular Expression Patterns (3/6)
metacharacter the caret ^
6 / 20
Basic Regular Expression Patterns (4/6)
metacharacter the question-mark ?
Kleene * zero or more occurrences of the immediately previous char
acter or regular expression /a*/ means ‘any string of zero or more as’ /aa*/ means ‘one or more as’ /[ab]*/ means ‘zero or more as or bs’
7 / 20
Basic Regular Expression Patterns (5/6)
Kleene + one or more of the previous character /baaa*!/ = /baa+!/
metacharacter period . (wildcard expression) /beg.n/
- any character between beg and n- begin, beg’n, begun
.*- any string o fcharacters
8 / 20
Basic Regular Expression Patterns (6/6)
Anchor special metacharacter caret ^ matches the start of a line dollar sign $ matches the end of line
- / ^The dog\.$/ matches a line that contains only the phrase The dog.
\b matches a word boundary
/the/ VS /\bthe\b/ there
/ ^ $/
9 / 20
Disjunction, Grouping, and Precedence
We can’t use the [] to search for “cat or dog” metacharater pipe symbol |
/cat | dog/ matches either cat or the string dog
How can I specify both guppy and guppies? /guppy|ies/ sequences like guppy take precedence over the | /guppy(y|ies)/
10 / 20
Disjunction, Grouping, and Precedence
operator precedence hierarchy
11 / 20
A simple Example
to write a RE to find cases of the English article the / the /
this pattern will miss the word when it begins a sentencc and hence is capitalized (i.e., The) / [tT]he /
the embedded in other words (e.g., other or theology) / \b[tT]he\b / / [^a-zA-Z] [tT]he [^a-zA-Z] / (^|/ [^a-zA-Z]) [tT]he [^a-zA-Z] /
12 / 20
A More Complex Example (1/2)
"any PC with more than 500 MHz and 32 Gb of disk space for less than $l000”
regular expression for prices (e.g., $999.99) simple regular expression for prices
- / $ [0-9] + / to deal with fractions of dollars
- / $ [0-9] + \. [0-9] [0-9] /
- this pattern only allows $199.99 but not $199
- / \b $ [0-9] + ( \. [0-9] [0-9] )? \b /
13 / 20
A More Complex Example (2/2)
regular expression for processor speed
regular expression operating systems and vendors
14 / 20
Advanced Operators (1/3)
Aliases for common sets of characters
15 / 20
Advanced Operators (2/3)
Regular expression operators for counting
/ a \.{24} z /
16 / 20
Advanced Operators (3/3)
Some characters that need to be backslashes
17 / 20
Regular Expression Substitution, Memory, and ELIZA(1/2)
Perl substitution operator s / regexp1 / regexp2 / s / colour / color /
number operator (using memory) changing the 35 boxes to <35> boxes
- s / ([0 - 9] +) / <\1> / /the (.*)er they were, the \ler they will be/
- will match The bigger they well be, the bigger they were
- but not The bigger they well be, the faster they were these numbered memories are called resisters “extended” feature of regular expressions
18 / 20
Regular Expression Substitution, Memory, and ELIZA(2/2)
number operator (Cont’) /the (.*)er they (.*), the \ler they \2/ will match The bigger they were, the bigger they were but not The bigger they were, the bigger they will be
ELIZA simple natural-language understanding program (1966) substitution using memory
19 / 20
Regular Expression Substitution, Memory, and ELIZA(3/3)
20 / 20