Difference between revisions of "Regular Expressions"

Latest revision as of 22:59, 21 December 2017

Regular expressions are a concise, powerful, if cryptic, formalism to specify a pattern with wild cards and so on to match desired text strings. They are very useful for text search and parsing. They play a prominent role in several widely-used programming languages, notably Perl and Python.

Analytica includes powerful regular expression processing, similar to Perl, in some built-in text functions, notably FindInText, SplitText, TextReplace, and FindObjects. Each of these functions has a parameter specifying text to find, which it interprets as a regular expression when you specify optional parameter: re: True. For example:

 { To find the position of a seven-letter word: }
 FindInText("\b\w{7}\b","Now is the time for all good men to come to the aid of their country", re:  1) → 62

The pattern "\b\w{7}\b" specifies a separator "\b", such as space, followed by seven "{7}" letters "w", ending in another separator "\b".

 { Split on any word having two repeated letters, }
 SplitText("When in the course of human events, it becomes necessary for ...", "[^\w]*\b\w*(\w)\w*\1\w*\b[^\w]*", re: 1) →
         ["When in the course of human", "it", "", "for ..."]

Basics of Regular Expressions

A regular expression may contain literal (i.e. uninterpreted) characters, such as the letters and digits, that must match the same character in the found text, and special characters that specify wildcard patterns and special actions. A simple sequence of literal characters, like 'this', is a simple regular expression that matches exactly that sequence of characters wherever it occurs in the subject text.

The power of regular expressions comes from the special characters and codes that specify a class of matching patterns. For example, the dot character means any character, so "t..s" matches any text with a "t", followed by any two characters, followed by "s", such as "this", "thus", "t as", etc. If you want to match a literal dot in the subject text, you should precede the dot with a backslash "\" escape character. So. "t.\." matches "th." and "ts." but not "ths". This goes for all special characters: If you mean it as literal, precede it by backslash.

Special characters used in Regular Expressions:

  \        escape character -- specify the next special character as a literal (which would otherwise be interpreted as a special character) --
 e.g. '\\' matches '\'.  Or  specify the next letter as as special code (which would otherwise be interpreted as a literal character) -- e.g. '\t' matches the tab character.
  ^       start of a string (or line, in multiline mode)
  $       end of string (or line, in multiline mode)
  .        match any character except newline (by default)
  [        start definition of a character class 
  ]         end definition a character class definition
  |        start of alternative branch
  (        start a subpattern
  )        end a subpattern
  ?        extends the meaning of (
           also 0 or 1 quantifier
           also quantifier minimizer
  *        0 or more of the previous character or subpattern
  +        1 or more of the previous character or subpattern
           also "possessive quantifier"
  {        start min/max quantifier
  \Q...\E  Treat all characters between \Q and \E as literals

A character class specifies a set of possible characters. It is part of a pattern within square brackets. The only special characters in a character class are:

  \      escape character
  ^      negate the class, but only if the first character
  -      indicates character range
  [      POSIX character class (only if followed by POSIX syntax)
  ]      terminates the character class

You can refer to non-printing characters thus:

  \a        alarm, that is, the BEL character (hex 07)
  \cx       "control-x", where x is any character
  \e        escape (hex 1B)
  \f        formfeed (hex 0C)
  \n        linefeed (hex 0A)
  \r        carriage return (hex 0D)
  \t        tab (hex 09)
  \R        any newline character, equivalent to (?>\r\n|\n|\x0b|\f|\r|\x85)
  \ddd      character with octal code ddd, or backreference
  \xhh      character with hex code hh
  \x{hhh..} character with hex code hhh..

Several character groups have special escape sequences, including:

    \w	     A "word" character -- letter, digit, or underscore "_"
    \W	     Non-"word" character -- any character that is not a letter, digit, or underscore
    \s	             Whitespace character-- space, tab, newline
    \S	     Non-whitespace character
    \d	     Digit character --  0 to 9.
    \D	     Non-digit character -- any character that is not a digit

And several escape characters match particular points within text that correspond to a position but not to an actual character:

  \b     matches at a word boundary
  \B     matches when not at a word boundary
  \A     matches at the start of the subject
  \Z     matches at the end of the subject
          also matches before a newline at the end of the subject
  \z     matches only at the end of the subject
  \G     matches at the first matching position in the subject

The full specification of regular expression patterns supported is described at Pcre Patterns Man Page.

Multi-line matching

By default, a regular expression can be used to match over multiple lines in the source text. The caret (^) and dollar ($) patterns match only the very first and very last character in the entire text, and don't match to the first character in a particular line. The \R pattern can be used to match line breaks, and is equivalent to (?:\r\n)|\r|\n. Line breaks in Analytica are usually \r, but \R is more robust in that it matches all three newline conventions.

You can instruct the matcher to operate in a multi-line mode, in which the text is treated as if composed of separate lines, where a pattern exists on a single line. In this mode, caret (^) matches each line start and dollar ($) matches each line end. To use this mode, begin the regular expression with (?m).

In theory (according to the Pcre library documenation), you should be able to control which newline character combinations are recognized as the beginning and end of the line. We haven't seen this work, so it may not actually have an effect. To indicate that any newline character combination should be recognized, start the regular expression with (*ANY), as in: "(*ANY)^\w\d{5}" (which would match to a line within the text beginning with a letter and 5 digits). The (*ANY) prefix considers any standard new-line combination (CR, LF, CRLF) to denote a line break.

Three conventions exist for new lines in text file formats. CR is the standard on the Mac. LF is standard on Unix. CRLF (two characters) is the standard in Windows. Analytica's functions like ReadTextFile typically convert to just CR. Excel on Windows (and in CSV files) may use CR for new-rows and LF for new-lines within a single cell. So, depending on where your data is coming from, there are sometimes cases in which you may want to use a multi-line mode, but only with a particular new-line character or combination recognized. The (*ANY) prefix recognizes any of these standard conventions as denoting a newline. (*CR) recognizes only CR, (*LF) recognizes on LF, and (*CRLF) recognizes only the CRLF combination. Note that each of these is a prefix that puts the matcher into a multi-line mode -- the character combinations (*CR) would not appear within the regular expression.

Finding Patterns in Text

The FindInText function, with several optional parameters, can be used to find patterns in text.

FindInText(pattern, text, caseInsensitive, re, return, subPattern)

Parameters:

«pattern»: the regular expression
«text»: the subject text being searched
«caseInsensitive»: When set to 1, matches 'a' to 'A', etc. Matches are case-sensitive by default.
«re»: Must be non-zero for pattern to be interpreted as a regular expression.
«return»: Specifies what information should be returned, as follows:; 'P' (or 'Position'): The position in the subject text where the matched pattern was found, or zero if not found.; 'L' (or 'Length'): The length of the match in the subject text.; 'S' (or 'SubPattern'): The subtext matched by the pattern; '#' (or '#SubPatterns'): The number of subpatterns in the regular expression.
«subPattern»: Which subpattern to return information on. See below.

When using FindInText, you have four different options for what information can be returned. By default, the position of the match (or zero if there is no match) is returned, but alternatively you can have it return the length of the match or the actual text that was successfully matched to . For example:

FindInText("[an]+", "A banana in a cabana", re: 1, return: 'S') → "anana"

If you want to obtain multiple items of information (such as the position, location and matching text) all at the same time, without repeating the match, pass an array to the «return» parameter.

Subpatterns

You can group subpatterns in a regular expression using parentheses. You can then extract the values matches to a particular subpattern by specifying which subpattern you are interested in using the «subPattern» parameter. The zeroth subpattern always corresponds to the full pattern, and from there grouped expressions are numbered in a depth-first order. You can also specify a group using parentheses whose contents is not to be retained using (?:...)

For example:

Index I := 0..4;

FindInText("([\w_]+)\s*:\s*((\d*,){4})(\d*), ", "NodeInfo: 1, 1, 1, 1, 1, 1, 0, , 0, ", re: 1, return: 'S', subPattern: I) →

0	"NodeInfo: 1, 1, 1, 1, 1, "
1	"NodeInfo"
2	"1, 1, 1, 1, "
3	"1, "
4	"1"

You can see here that subPattern: 4 in this example extracts the 5th number in the comma-separated list.

To figure out how many subPatterns are present, you can set the «return» parameter to '#'. If return contains only '#' (i.e., it isn't an array with other 'P', 'L' or 'S' elements), it will determine the number of subPatterns in the regular expression without actually executing a matching search. Thus, if you wanted to pass an index to «subPattern», you can figure out how long to make the index before executing the match.

There can be many groupings, and the number and order of groups may change as you debug your regular expression, so using numbered subpatterns is not always the best. You can instead use named subpatterns. The syntax for naming a group is: (?<name>...), or (?'name'....) or (?P<name>...). When you have named a subpattern, you can extract its value by passing the textual name to the «subPattern» parameter.

 FindInText("([\w_]+)\s*:\s*((\d*,){4})(?<border>\d*),", 
                "NodeInfo: 1, 1, 1, 1, 1, 1, 0, , 0,", re: 1, 
                return: 'S', subPattern: 'border') → "1"

Duplicate Subpatterns

Cases frequently arise in which there are two or more alternative syntaxes for subpattern, requiring two subpatterns within the regular expression to have the same name, but usually these are disjunctive. For example, in a standard Excel-compatible CSV format, a cell with no comma or new-line characters does not need to be quoted, but if the cell contains a comma, quotes must be placed around it. For example, a line of a CSV file might be:

San Jose, "1, 006, 102", 10, Chuck Reed, "SJ, San José, SJC"

This CSV entry has 5 items separated by commas, but two items have internal commas and thus are quoted. Thus, each item matches one of two possible regular expressions, either: ([^,]+) or "(.*?)". Notice that the parenthesis in the second case do not include the quotes, since we do not which to include that in the pattern. To match either, we form a disjunction, but since they refer to the item, we name both branches with the same subpattern name:

("(?<city>.+?)")|(?<city>[^,]+?),\s*("(?<pop>.+?)")|(?<pop>[^,]+?)

Because the two subpatterns named city are disjunctive, only one of them will match. So, when you request the subpattern "city", you'll get the one which matched (the second in the example). Similarly, only one "pop" subpattern will match, in this example the first, so you'll get info for the one that actually matched.

You could have multiple matches to a subpattern (either named or numbered), as occurs with the regular expression "b(a)*c" applied to "dbaaacd". There is a limitation here in that you can only get the data for one of the repeated matches, the last one.

Splitting on a Pattern

You can provide a regular expression as the separator to the SplitText function. This makes it possible to split text into parts in such a way that allows multiple types of separators, variable length separators, or uncertainty about what the separator will be.

For example, to split on any punctuation character:

SplitText(text, "[\.\?,!]", re: 1)

Or to split on any number of spaces, so that you don't get blank spaces between separators:

SplitText(text, "\s+", re: 1)

Notice that the parameter re: 1 must be specified to cause the separator to be interpreted as a regular expression.

Substitutions

The TextReplace function accepts a regular expression as its pattern when the re: 1 parameter is specified.

TextReplace(text, pattern, subst, all, caseInsensitive, re)

Parameters:

«text»: the subject text being matched to
«pattern»: The regular expression
«subst»: the text to be substituted for the subtext that matches pattern
«all»: 0 = replace only the first occurrence (default); 1 = replace every occurrence
«caseInsensitive»: 1='A' matches 'a', etc. CaseSensitive by default.
«re»: Must be set to 1 for regular expressions

It is recommended that you use a named-parameter calling syntax for the optional parameters. Here are some examples:

TextReplace("3.141592654", "1|5|9", "0", re: 1 ) → "3.041592564"

TextReplace("3.141592654", "1|5|9", "0", re: 1, all: 1) → "3.040002604"

TextReplace("3.141592654", "(1|5|9)+", "0", re: 1, all: 1) → "3.140002654"

SubPattern Substitutions

When regular expressions are used, the «subst» parameter may refer to subPattern groupings that appear in the «pattern» parameter. The matching text for those is substituted accordingly. \0 denotes the full text matched by the full regular expression, \1 is the first subpattern, \2 the second, up to \9.

You can also refer to named subpatterns using <name> in the «subst» parameter. Again, the subtext matching the corresponding named subpattern is substituted. Some examples:

TextReplace("3.141592", "(\d)", "\1\1", re: 1) → "33.141592"

TextReplace("3.141592", "(\d)", "\1\1", re: 1, all: 1) → "33.114411559922"

TextReplace("time", "(.)(.)(.)(.)", "\4\3\2\1", re: 1, all: 1) → "emit"

TextReplace("543,632","(?<x>\d+),(?<y>\d+)", "<y>,<x>", re: 1, all: 1) → "632,543"

Credits

Analytica makes use of the Perl Compatible Regular Expression library, written by Philip Hazel (email: ph10 at cam.ac.uk) of the University of Cambridge Computing Service, Cambridge, England. Copyright (c) 1997-2008 University of Cambridge All rights reserved.

The library is included in Analytica under the "BSD" license published with the PCRE release 7.8 distributable.