§20.6. Regular expression matching

When playing around with text, we tend to get into longer and trickier wrangles of matching - we find that we want to look not for simple text like "gold", but for "gold" used only as a separate word, or for a date in YYYY-MM-DD format, or for a seemingly endless range of other possibilities. What we need is not just for Inform to provide a highly flexible matching program, but also a good notation in which to describe what we want.

Fortunately, such a notation already exists. This is the "regular expression" notation, named for a 1950s mathematical model by the logician Stephen Kleene, applied to computing in the late 60s by Ken Thompson, borrowed almost at once by the early Unix tools of the 70s, and developed further by Henry Spencer in the 80s and Philip Hazel in the 90s. The glue holding the Internet together - the Apache web-server, the scripting languages Perl and Python, and so forth - makes indispensable use of regular expressions.

As might be expected from the previous section, we simply have to describe the FIND text as "regular expression" rather than "text" and then the same facilities are available:

if (text) matches the regular expression (text):

This condition is true if any contiguous part of the text can be matched against the given regular expression. Examples:

if "taramasalata" matches the regular expression "a.*l", ...

is true, since this looks for a part of "taramasalata" which begins with "a", continues with any number of characters, and finishes with "l"; so it matches "aramasal". (Not "asal", because it gets the makes the leftmost match it can.) The option "case insensitively" causes lower and upper case letters to be treated as equivalent.

if (text) exactly matches the regular expression (text):

This condition is true if the whole text (starting from the beginning and finishing at the end) can be matched against the given regular expression. The option "case insensitively" causes lower and upper case letters to be treated as equivalent.

And once again:

number of times (text) matches the regular expression (text) ... number

This produces the number of times that contiguous pieces of the text can be matched against the regular expression, without allowing them to overlap.

Since a regular expression can match quite a variety of possibilities (for instance "b\w+t" could match "boast", "boat", "bonnet" and so on), it's sometimes useful to find what the match actually was:

text matching regular expression ... text

This phrase is only meaningful immediately after a successful match of a regular expression against text, and it produces the text which matched. Example:

if "taramasalata" matches the regular expression "m.*l":
    say "[text matching regular expression].";

says "masal."

Perhaps fairly, perhaps not, regular expressions have a reputation for being inscrutable. The basic idea is that although alphanumeric characters (letters, numbers and spaces) mean just what they look like, punctuation characters are commands with sometimes dramatic effects. Thus:

if WHATEVER matches the regular expression "fish", ...
if WHATEVER matches the regular expression "f.*h", ...

behave very differently. The first is just like matching the text "fish", but the second matches on any sequence of characters starting with an "f" and ending with an "h". This is not at all obvious at first sight: reading regular expressions is a skill which must be learned, like reading a musical score. A really complex regular expression can look like a soup of punctuation and even an expert will blink for a few minutes before telling you what it does - but a beginner can pick up the basics very quickly. Newcomers might like to try out and become comfortable with the features a few at a time, reading down the following list.

1. Golden rule. Don't try to remember all the characters with weird effects. Instead, if you actually mean any symbol other than a letter, digit or space to be taken literally, place a backslash "\" in front of it. For instance, matching the regular expression

"\*A\* of the Galactic Patrol"

is the same as matching the text "*A* of the Galactic Patrol", because the asterisks are robbed of their normal powers. This includes backslash itself: "\\" means a literal backslash. (Don't backslash letters or digits - that turns out to have a meaning all its own, but anyway, there is never any need.)

2. Alternatives. The vertical stroke "|" - not a letter I or L, nor the digit 1 - divides alternatives. Thus

"the fish|fowl|crawling thing"

is the same as saying match "the fish", or "fowl", or "crawling thing".

3. Dividing with brackets. Round brackets "(" and ")" group parts of the expression together.

"the (fish|fowl|crawling thing) in question"

is the same as saying match "the fish in question", or "the fowl in question", or "the crawling thing in question". Note that the "|" ranges outwards only as far as the group it is in.

4. Any character. The period "." means any single character. So

"a...z"

matches on any sequence of five characters so long as the first is "a" and the last is "z".

5. Character alternatives. The angle brackets "<" and ">" are a more concise way of specifying alternatives for a single character. Thus

"b<aeiou>b"

matches on "bab", "beb", "bib", "bob" or "bub", but not "baob" or "beeb" - any single character within the angle brackets is accepted. Beginning the range with "^" means "any single character so long as it is not one of these": thus

"b<^aeiou>b"

matches on "blb" but not "bab", "beb", etc., nor on "blob" or "bb". Because long runs like this can be a little tiresome, we are also allowed to use "-" to indicate whole ranges. Thus

"b<a-z>b"

matches a "b", then any lower case English letter, then another "b".

In traditional regular expression language, square brackets rather than angle brackets are used for character ranges. In fact Inform does understand this notation if there are actual square brackets "[" and "]" in the pattern text, but in practice this would be tiresome to achieve, since Inform uses those to achieve text substitutions. So Inform allows "b<a-z>b" rather than making us type something like

"b[bracket]a-z[close bracket]b"

to create the text "b[a-z]b".

6. Popular character ranges. The range "<0-9>", matching any decimal digit, is needed so often that it has an abbreviation: "\d". Thus

"\d\d\d\d-\d\d-\d\d"

matches, say, "2006-12-03". Similarly, "\s" means "any spacing character" - a space, tab or line break. "\p" is a punctuation character, in the same sense used for word division in the previous section: it actually matches any of

. , ! ? - / " : ; ( ) [ ] { }

"\w" means "any character appearing in a word", and Inform defines it as anything not matching "\s" or "\p".

"\l" and "\u" match lower and upper case letters, respectively. These are much stronger than "<a-z>" and "<A-Z>", since they use the complete definition in the Unicode 4.0.0 standard, so that letter-forms from all languages are catered for: for example "δ" matches "\l" and "Δ" matches "\u".

The reverse of these is achieved by capitalising the letter. So "\D" means "anything not a digit", "\P" means "anything not punctuation", "\W" means "anything not a word character", "\L" means "anything not a lower case letter" and so on.

7. Positional restrictions. The notation "^" does not match anything, as such, but instead requires that we be positioned at the start of the text. Thus

"^fish"

matches only "fish" at the start of the text, not occurring anywhere later on. Similarly, "$" requires that the position be the end of the text. So

"fish$"

matches only if the last four characters are "fish". Matching "^fish$" is the same thing as what Inform calls exactly matching "fish".

Another useful notation is "\b", which matches a word boundary: that is, it matches no actual text, but requires the position to be a junction between a word character and a non-word character (a "\w" and a "\W") or vice versa. Thus

"\bfish\b"

matches "fish" in "some fish" and also "some fish, please!", but not in "shellfish". (The regular expression "\w*fish\b" catches all words ending in "fish", as we will see below.) As usual, the capitalised version "\B" negates this, and means "not at a word boundary".

8. Line break and tab. The notations "\n" and "\t" are used for a line break ("n" for "new line") and tab, respectively. Tabs normally do not occur in Inform strings, but can do when reading from files. It makes no sense to reverse these, so "\N" and "\T" produce errors.

9. Repetition. Placing a number in braces "{" and "}" after something says that it should be repeated that many times. Thus

"ax{25}"

matches only on "axxxxxxxxxxxxxxxxxxxxxxxxx". More usefully, perhaps, we can specify a range of the number of repetitions:

"ax{2,6}"

matches only on "axx", "axxx", "axxxx", "axxxxx", "axxxxxx". And we can leave the top end open: "ax{2,}" means "a" followed by at least two "x"s.

Note that the braces attach only to most recent thing - so "ax{2}" means "a" followed by two of "x" - but, as always, we can use grouping brackets to change that. So "(ax){2,}" matches "axax", "axaxax", "axaxaxax",...

(It's probably best not to use Inform to try to match the human genome against "<acgt>{3000000000}", but one of the most important practical uses of regular expression matching in science is in treating DNA as a string of nucleotides represented by the letters "a", "c", "g", "t", and looking for patterns.)

10. Popular repetitions. Three cases are so often needed that they have standard short forms:

"{0,1}", which means 0 or 1 repetition of something - in other words, doesn't so much repeat it as make it optional - is written "?". Thus "ax?y" matches only on "ay" or "axy".

"{0,}", which means 0 or more repetitions - in other words, any number at all - is written "*". Thus "ax*y" matches on "ay", "axy", "axxy", "axxxy", ... and the omnivorous ".*" - which means "anything, any number of times" - matches absolutely every text.

"{1,}", which means 1 or more repetitions, is written "+". So "\d+" matches any run of digits, for instance.

11. Greedy vs lazy. Once we allow things to repeat an unknown number of times, we run into an ambiguity. Sure, "\d+" matches the text "16339b". But does it look only as far as the "1", then reason that it now has one or more digits in a row, and stop? Or does it run onward devouring digits until it can do so no longer, so matching the "16339" part? These two strategies are called "lazy" and "greedy" respectively.

Do we care? Well, the strategy used makes no difference to whether there is a match, but it does affect what part of the text is matched, and the number of matches there are. Unless we mark for it, all repetitions are greedy. Usually this is good, but it means that, for instance,

"-.+-"

applied to "-alpha- -beta- -gamma-" will match the whole text, because ".+" picks up all of "alpha- -beta- -gamma". To get around this, we can mark any of the repetition operators as lazy by adding a question mark "?". Thus:

"-.+?-"

applied to "-alpha- -beta- -gamma-" matches three times, producing "-alpha-" then "-beta-" then "-gamma-".

A logical but sometimes confusing consequence is that a doubled question mark "??" means "repeat 0 or 1 times, but prefer 0 matches to 1 if both are possibilities": whereas a single question mark "?", being greedy, means "repeat 0 or 1 times, but prefer 1 match to 0 if both are possibilities".

12. Numbered groups. We have already seen that round brackets are useful to clump together parts of the regular expression - to choose within them, or repeat them. In fact, Inform numbers these from 1 upwards as they are used from left to right, and we can subsequently refer back to their contents with the notation "\1", "\2", ... After a successful match, we can find the results of these subexpressions with:

text matching subexpression (number) ... text

This phrase is only meaningful immediately after a successful match of a regular expression against text, and it produces the text which matched. The number must be from 1 to 9, and must correspond to one of the bracketed groups in the expression just matched. Example: after

if "taramasalata" matches the regular expression "a(r.*l)a(.)":

the "text matching regular expression" is "aramasalat", the "text matching subexpression 1" is "ramasal", and "text matching subexpression 2" is "t".

For instance:

"(\w)\w*\1"

matches any run of two or more word-characters, subject to the restriction that the last one has to be the same as the first - so it matches "xerox" but not "alphabet". When Inform matches this against "xerox", first it matches the initial "x" against the group "(\w)". It then matches "\w*" ("any number of word-characters") against "ero", so that the "*" runs up to 3 repetitions. It then matches "\1" against the final "x", because "\1" requires it to match against whatever last matched in sub-expression 1 - which was an "x".

Numbered groups allow wicked tricks in matching, it's true, but really come into their own when it comes to replacing - as we shall see.

13. Switching case sensitivity on and off. The special notations "(?i)" and "(?-i)" switch sensitivity to upper vs. lower case off and on, mid-expression. Thus "a(?i)bcd(?-i)e" matches "abcde", "aBcDe", etc., but not "Abcde" or "abcdE".

14. Groups with special meanings. This is the last of the special syntaxes: but it's a doozy. A round-bracketed group can be marked to behave in a special way by following the open bracket by a symbol with a special meaning. Groups like this have no number and are not counted as part of \1, \2, and so forth - they are intended not to gather up material but to have some effect of their own.

"(# ...)"

Is a comment, that is, causes the group to do nothing and match against anything.

"(?= ...)"

Is a lookahead: it is a form of positional requirement, like "\b" or "^", but one which requires that the text ahead of us matches whatever is in the brackets. (It doesn't consume that text - only checks to see that it's there.) For instance "\w+(?=;)" matches a word followed by a semicolon, but does not match the semicolon itself.

"(?! ...)"

Is the same but negated: it requires that the text ahead of us does not match the material given. For instance, "a+(?!z)" matches any run of "a"s not followed by a "z".

"(?<= ...)" and "(?<! ...)"

Are the same but looking behind (hence the "<"), not forward. These are restricted to cases where Inform can determine that the material to be matched has a definite known width. For instance, "(?<!shell)fish" matches any "fish" not occurring in "shellfish".

"(> ...)"

Is a possessive, that is, causes the material to be matched and, once matched, never lets go. No matter what subsequently turns out to be convenient, it will never change its match. For instance, "\d+8" matches against "768" because Inform realises that "\d+" cannot be allowed to eat the "8" if there is to be a match, and stops it. But "(>\d+)8" does not match against "768" because now the "\d+", which initially eats "768", is possessive and refuses to give up the "8" once taken.

"(?(1)...)" and "(?(1)...|...)"

Are conditionals. These require us to match the material given if \1 has successfully matched already; in the second version, the material after the "|" must be matched if \1 has not successfully matched yet. And the same for 2, 3, ..., 9, of course.

Finally, conditionals can also use lookaheads or lookbehinds as their conditions. So for instance:

"(?(?=\d)\d\d\d\d|AY-\d\d\d\d)"

means if you start with a digit, match four digits; otherwise match "AY-" followed by four digits. There are easier ways to do this, of course, but the really juicy uses of conditionals are only borderline legible and make poor examples - perhaps this is telling us something.


arrow-up.pngStart of Chapter 20: Advanced Text
arrow-left.pngBack to §20.5. Matching and exactly matching
arrow-right.pngOnward to §20.7. Making new text with text substitutions

*ExampleAlpha
Creating a beta-testing command that matches any line starting with punctuation.

*ExampleAbout Inform's regular expression support
Some footnotes on Inform's regular expressions, and how they compare to those of other programming languages.