Regular Expressions
Regular expressions are a concise and powerful, but cryptic, formalism for identifying patterns of text to match. They can be quite useful for parsing text files that have minor variability in their formats. They play a prominent role in several programming languages, most notably Perl and Python.
Starting with release 4.2, Analytica provides very powerful (Perl-compatible) regular expression processing within several of its built-in text functions, notable FindInText, Split, and TextReplace. Each of these functions takes a pattern, which is interpreted as a regular expression when you also specify an optional parameter: re:True. For example:
{To find the position of a seven-letter word:} FindInText("\b\w{7}\b","Now is the time for all good men to come to the aid of their country",re:1) → 62
{Split on any word having two repeated letters,} Split("When in the course of human events, it becomes necessary for ...","[^\w]*\b\w*(\w)\w*\1\w*\b[^\w]*",re:1)→ ["When in the course of human", "it", "", "for ..."]
Basics of Regular Expression
Regular expressions consist of basic uninterpreted characters (such as the letters and digits), and several special characters that are interpreted. A simple sequence of non-special characters, like "this", is a simple regular expression that matches when that precise sequence of characters occurs anywhere within the subject text.
The power of regular expressions comes from the special sequences that can be used to specify large classes of matching patterns. For example, the dot character means match any character, so that the regular expression "t..s" matches anywhere a "t" is followed by any two characters and then by "s", such as "this", "ttts", "t as", etc. If you want to match only to a dot in the subject text, then you preceed the dot with a backslash, e.g., "t.\." matches "th." and "ts." but not "ths". In general, if you want to use any of the special characters as literals, you preceed them with a backslash.
Other special characters are these:
\ general escape character with several uses ^ assert start of string (or line, in multiline mode) $ assert end of string (or line, in multiline mode) . match any character except newline (by default) [ start character class definition | start of alternative branch ( start subpattern ) end subpattern ? extends the meaning of ( also 0 or 1 quantifier also quantifier minimizer * 0 or more quantifier + 1 or more quantifier also "possessive quantifier" { start min/max quantifier
Part of a pattern that is in square brackets is called a "character class". In a character class the only metacharacters are:
\ general escape character ^ negate the class, but only if the first character - indicates character range [ POSIX character class (only if followed by POSIX syntax) ] terminates the character class
You can also refer to several non-printing characters using the following sequences:
\a alarm, that is, the BEL character (hex 07) \cx "control-x", where x is any character \e escape (hex 1B) \f formfeed (hex 0C) \n linefeed (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) \ddd character with octal code ddd, or backreference \xhh character with hex code hh \x{hhh..} character with hex code hhh..
The full specification of regular expression patterns supported is described at Pcre Patterns Man Page.
Finding Patterns in Text
The FindInText function, with several optional parameters, can be used to find patterns in text.
FindInText(pattern, text, caseInsensitive, re, return, subPattern)
- pattern: the regular expression
- text: the subject text being searched
- caseInsensitive: When set to 1, matches 'a' to 'A', etc. Matches are case-sensitive by default.
- re: Must be non-zero for pattern to be interpreted as a regular expression.
- return: Specifies what information should be returned, as follows:
- 'P' (or 'Position'): The position in the subject text where the matched pattern was found, or zero if not found.
- 'L' (or 'Length'): The length of the match in the subject text.
- 'S' (or 'SubPattern'): The subtext matched by the pattern
- '#' (or '#SubPatterns'): The number of subpatterns in the regular expression.
- subPattern: Which subpattern to return information on. See below.
When using FindInText, you have four different options for what information can be returned. By default, the position of the match (or zero if there is no match) is returned, but alternatively you can have it return the length of the match or the actual text that was successfully matched to . For example:
- FindInText("[an]+", "A banana in a cabana", re:1, return:'S') → "anana"
If you want to obtain multiple items of information (such as the position, location and matching text) all at the same time, without repeating the match, pass an array to the return parameter.
Subpatterns
You can group subpatterns in a regular expression using parentheses. You can then extract the values matches to a particular subpattern by specifying which subpattern you are interested in using the subPattern parameter. The zeroth subpattern always corresponds to the full pattern, and from there grouped expressions are numbered in a depth-first order. You can also specify a group using parentheses whose contents is not to be retained using (?:...)
For example:
Index I := 0..4; FindInText("([\w_]+)\s*:\s*((\d*,){4})(\d*),", "NodeInfo: 1,1,1,1,1,1,0,,0,", re:1, return:'S', subPattern:I) →
0 | "NodeInfo : 1,1,1,1,1," |
---|---|
1 | "NodeInfo" |
2 | "1,1,1,1," |
3 | "1," |
4 | "1" |
Credits
Analytica makes use of the Perl Compatible Regular Expression library, written by Philip Hazel (email: ph10 at cam.ac.uk) of the University of Cambridge Computing Service, Cambridge, England. Copyright (c) 1997-2008 University of Cambridge All rights reserved.
The library is included in Analytica under the "BSD" license published with the PCRE release 7.8 distributable.
Enable comment auto-refresher