Difference between revisions of "Regular Expressions"

m
m
 
(14 intermediate revisions by 4 users not shown)
Line 2: Line 2:
 
[[Category:Text Functions]]
 
[[Category:Text Functions]]
  
Regular expressions are a concise and powerful, but cryptic, formalism for identifying patterns of text to match.  They can be quite useful for parsing text files that have minor variability in their formats.  They play a prominent role in several programming languages, most notably Perl and Python.
+
__TOC__
  
Starting with release 4.2, Analytica provides very powerful (Perl-compatible) regular expression processing within several of its built-in [[:Category:Text Functions|text functions]], notably [[FindInText]], [[SplitText]], and [[TextReplace]]Each of these functions takes a pattern, which is interpreted as a regular expression when you also specify an optional parameter: ''re:True''. For example:
+
Regular expressions are a concise, powerful, if cryptic, formalism to specify a pattern with wild cards and so on to match desired text stringsThey are very useful for text search and parsing. They play a prominent role in several widely-used programming languages, notably Perl and Python.  
  
  {To find the position of a seven-letter word:}
+
Analytica includes powerful regular expression processing, similar to Perl, in some built-in [[:Category:Text Functions|text functions]], notably [[FindInText]], [[SplitText]], [[TextReplace]], and [[FindObjects]].  Each of these functions has a parameter specifying text to find, which it interprets as a regular expression when you specify optional parameter: <code>re: True</code>.  For example:
[[FindInText]]("\b\w{7}\b","Now is the time for all good men to come to the aid of their country",re:1) &rarr; 62
 
  
  {Split on any word having two repeated letters,}
+
<pre style="background:white; border:white; margin-left: 1em;">
  [[SplitText]]("When in the course of human events, it becomes necessary for ...","[^\w]*\b\w*(\w)\w*\1\w*\b[^\w]*",re:1)&rarr;
+
{ To find the position of a seven-letter word: }
 +
FindInText("\b\w{7}\b","Now is the time for all good men to come to the aid of their country", re:  1) &rarr; 62
 +
</pre>
 +
 
 +
The pattern <code>"\b\w{7}\b"</code> specifies a separator "\b", such as space, followed by seven "{7}" letters "w", ending in another separator "\b".
 +
<pre style="background:white; border:white; margin-left: 1em;">
 +
  { Split on any word having two repeated letters, }
 +
  SplitText("When in the course of human events, it becomes necessary for ...", "[^\w]*\b\w*(\w)\w*\1\w*\b[^\w]*", re: 1) &rarr;
 
         ["When in the course of human", "it", "", "for ..."]
 
         ["When in the course of human", "it", "", "for ..."]
 +
</pre>
  
= Basics of Regular Expression =
+
== Basics of Regular Expressions ==
  
Regular expressions consist of basic uninterpreted characters (such as the letters and digits), and several special characters that are interpreted.  A simple sequence of non-special characters, like "this", is a simple regular expression that matches when that precise sequence of characters occurs anywhere within the subject text.   
+
A regular expression may contain literal (i.e. uninterpreted) characters, such as the letters and digits, that must match the same character in the found text, and special characters that specify wildcard patterns and special actions.  A simple sequence of literal characters, like 'this', is a simple regular expression that matches exactly that sequence of characters wherever it occurs in the subject text.   
  
The power of regular expressions comes from the special sequences that can be used to specify large classes of matching patterns.  For example, the dot character means ''match any character'', so that the regular expression "t..s" matches anywhere a "t" is followed by any two characters and then by "s", such as "this", "ttts", "t as", etc.  If you want to match only to a dot in the subject text, then you preceed the dot with a backslash, e.g., "t.\." matches "th." and "ts." but not "ths".  In general, if you want to use any of the special characters as literals, you preceed them with a backslash.
+
The power of regular expressions comes from the special characters and codes that specify a class of matching patterns.  For example, the dot character means ''any character'', so <code>"t..s"</code> matches any text with a "t", followed by any two characters, followed by "s", such as "this", "thus", "t as", etc.  If you want to match a literal dot in the subject text, you should precede the dot with a backslash "\" escape character. So. "t.\." matches "th." and "ts." but not "ths".  This goes for all special characters: If you mean it as literal, precede it by backslash.
  
Other special characters are these:
+
Special characters used in Regular Expressions:
  
<pre>
+
<pre style="background:white; border:white; margin-left: 1em;">
   \     general escape character with several uses
+
   \       escape character -- specify the next special character as a literal (which would otherwise be interpreted as a special character) --
   ^     assert start of string (or line, in multiline mode)
+
e.g. '\\' matches '\'.  Or  specify the next letter as as special code (which would otherwise be interpreted as a literal character) -- e.g. '\t' matches the tab character.
   $     assert end of string (or line, in multiline mode)
+
   ^       start of a string (or line, in multiline mode)
   .     match any character except newline (by default)
+
   $       end of string (or line, in multiline mode)
   [     start character class definition
+
   .       match any character except newline (by default)
   |     start of alternative branch
+
   [       start definition of a character class
   (     start subpattern
+
  ]        end definition a character class definition
   )     end subpattern
+
   |       start of alternative branch
   ?     extends the meaning of (
+
   (       start a subpattern
        also 0 or 1 quantifier
+
   )       end a subpattern
        also quantifier minimizer
+
   ?       extends the meaning of (
   *     0 or more quantifier
+
          also 0 or 1 quantifier
   +     1 or more quantifier
+
          also quantifier minimizer
        also "possessive quantifier"
+
   *       0 or more of the previous character or subpattern
   {     start min/max quantifier
+
   +       1 or more of the previous character or subpattern
 +
          also "possessive quantifier"
 +
   {       start min/max quantifier
 +
  \Q...\E  Treat all characters between \Q and \E as literals
 
</pre>
 
</pre>
Part of a pattern that is in square brackets is called a "character class". In
+
 
a character class the only metacharacters are:
+
A '''character class''' specifies a set of possible characters. It is part of a pattern within square brackets. The only special characters in a character class are:
<pre>
+
 
   \      general escape character
+
<pre style="background:white; border:white; margin-left: 1em;">
 +
   \      escape character
 
   ^      negate the class, but only if the first character
 
   ^      negate the class, but only if the first character
 
   -      indicates character range
 
   -      indicates character range
Line 48: Line 59:
 
</pre>
 
</pre>
  
You can refer to several non-printing characters using the following sequences:
+
You can refer to non-printing characters thus:
<pre>
+
 
 +
<pre style="background:white; border:white; margin-left: 1em;">
 
   \a        alarm, that is, the BEL character (hex 07)
 
   \a        alarm, that is, the BEL character (hex 07)
 
   \cx      "control-x", where x is any character
 
   \cx      "control-x", where x is any character
Line 57: Line 69:
 
   \r        carriage return (hex 0D)
 
   \r        carriage return (hex 0D)
 
   \t        tab (hex 09)
 
   \t        tab (hex 09)
 +
  \R        any newline character, equivalent to (?>\r\n|\n|\x0b|\f|\r|\x85)
 
   \ddd      character with octal code ddd, or backreference
 
   \ddd      character with octal code ddd, or backreference
 
   \xhh      character with hex code hh
 
   \xhh      character with hex code hh
Line 63: Line 76:
  
 
Several character groups have special escape sequences, including:
 
Several character groups have special escape sequences, including:
<pre>
+
 
     \w     Match a "word" character (letters plus "_")
+
<pre style="background:white; border:white; margin-left: 1em;">
     \W     Match a non-"word" character
+
     \w     A "word" character -- letter, digit, or underscore "_"
     \s     Match a whitespace character
+
     \W     Non-"word" character -- any character that is not a letter, digit, or underscore
     \S     Match a non-whitespace character
+
     \s             Whitespace character-- space, tab, newline
     \d     Match a digit character
+
     \S     Non-whitespace character
     \D     Match a non-digit character
+
     \d     Digit character --  0 to 9.
 +
     \D     Non-digit character -- any character that is not a digit
 
</pre>
 
</pre>
  
 
And several escape characters match particular points within text that correspond to a position but not to an actual character:
 
And several escape characters match particular points within text that correspond to a position but not to an actual character:
<pre>
+
 
 +
<pre style="background:white; border:white; margin-left: 1em;">
 
   \b    matches at a word boundary
 
   \b    matches at a word boundary
 
   \B    matches when not at a word boundary
 
   \B    matches when not at a word boundary
Line 85: Line 100:
 
The full specification of regular expression patterns supported is described at [http://www.newlisp.org/downloads/pcrepattern.html Pcre Patterns Man Page].
 
The full specification of regular expression patterns supported is described at [http://www.newlisp.org/downloads/pcrepattern.html Pcre Patterns Man Page].
  
= Multi-line matching =
+
== Multi-line matching ==
  
By default, a regular expression can be used to match over multiple lines in the source text. The caret (^) and dollar ($) patterns match only the very first and very last character in the entire text, and don't match to the first character in a particular line. The \r and \n patterns can, of course, be used to match to line breaks (in most cases within Analytica, lines will be terminated with \r, but you aren't sure which type of line break you are dealing with, you can always use <code>(?:\r\n)|\r|\n</code>.
+
By default, a regular expression can be used to match over multiple lines in the source text. The caret (<code>^</code>) and dollar (<code>$</code>) patterns match only the very first and very last character in the entire text, and don't match to the first character in a particular line. The <code>\R</code> pattern can be used to match line breaks, and is equivalent to <code>(?:\r\n)|\r|\n</code>. Line breaks in Analytica are usually <code>\r</code>, but <code>\R</code> is more robust in that it matches all three newline conventions.
  
You can instruct the matcher to operate in a multi-line mode, in which the text is treated as if composed of separate lines, where a pattern exists on a single line.  In this mode, caret (^) matches each line start and dollar ($) matches each line end.  To use this mode, begin the regular expression with <code>(?m)</code>.   
+
You can instruct the matcher to operate in a multi-line mode, in which the text is treated as if composed of separate lines, where a pattern exists on a single line.  In this mode, caret (<code>^</code>) matches each line start and dollar (<code>$</code>) matches each line end.  To use this mode, begin the regular expression with <code>(?m)</code>.   
  
In theory (according to the Pcre library documenation), you should be able to control which newline character combinations are recognized as the beginning and end of the line.  We haven't seen this work, so it may not actually have an effect.  To indicate that any newline character combination should be recognized, start the regular expression with <code>(*ANY)</code>, as in: "(*ANY)^\w\d{5}"  (which would match to a line within the text beginning with a letter and 5 digits).  The (*ANY) prefix considers any standard new-line combination (CR, LF, CRLF) to denote a line break.
+
In theory (according to the Pcre library documenation), you should be able to control which newline character combinations are recognized as the beginning and end of the line.  We haven't seen this work, so it may not actually have an effect.  To indicate that any newline character combination should be recognized, start the regular expression with <code>(*ANY)</code>, as in: <code>"(*ANY)^\w\d{5}"</code> (which would match to a line within the text beginning with a letter and 5 digits).  The <code>(*ANY)</code> prefix considers any standard new-line combination (<code>CR, LF, CRLF</code>) to denote a line break.
  
Three conventions exist for new lines in text file formats.  CR is the standard on the Mac.  LF is standard on Unix.  CRLF (two characters) is the standard in Windows.  Analytica's functions like [ReadTextFile] typically convert to just CR.  Excel on Windows (and in CSV files) may use CR for new-rows and LF for new-lines within a single cell.  So, depending on where your data is coming from, there are sometimes cases in which you may want to use a multi-line mode, but only with a particular new-line character or combination recognized.  The (*ANY) prefix recognizes any of these standard conventions as denoting a newline.  (*CR) recognizes only CR, (*LF) recognizes on LF, and (*CRLF) recognizes only the CRLF combination.  Note that each of these is a prefix that puts the matcher into a multi-line mode -- the character combinations (*CR) would not appear within the regular expression.
+
Three conventions exist for new lines in text file formats.  CR is the standard on the Mac.  LF is standard on Unix.  CRLF (two characters) is the standard in Windows.  Analytica's functions like [[ReadTextFile]] typically convert to just CR.  Excel on Windows (and in CSV files) may use CR for new-rows and LF for new-lines within a single cell.  So, depending on where your data is coming from, there are sometimes cases in which you may want to use a multi-line mode, but only with a particular new-line character or combination recognized.  The (*ANY) prefix recognizes any of these standard conventions as denoting a newline.  (*CR) recognizes only CR, (*LF) recognizes on LF, and (*CRLF) recognizes only the CRLF combination.  Note that each of these is a prefix that puts the matcher into a multi-line mode -- the character combinations (*CR) would not appear within the regular expression.
  
= Finding Patterns in Text =
+
== Finding Patterns in Text ==
  
 
The [[FindInText]] function, with several optional parameters, can be used to find patterns in text.
 
The [[FindInText]] function, with several optional parameters, can be used to find patterns in text.
  
====[[FindInText]](pattern, text'', caseInsensitive, re, return, subPattern'')====
+
====FindInText(pattern, text'', caseInsensitive, re, return, subPattern'')====
* ''pattern'': the regular expression
+
 
* ''text'': the subject text being searched
+
Parameters:
* ''caseInsensitive'': When set to 1, matches 'a' to 'A', etc.  Matches are case-sensitive by default.
+
;«pattern»: the regular expression
* ''re'': Must be non-zero for pattern to be interpreted as a regular expression.
+
;«text»: the subject text being searched
* ''return'': Specifies what information should be returned, as follows:
+
;«caseInsensitive»: When set to <code>1</code>, matches 'a' to 'A', etc.  Matches are case-sensitive by default.
** 'P' (or 'Position'): The position in the subject ''text'' where the matched pattern was found, or zero if not found.
+
;«re»: Must be non-zero for pattern to be interpreted as a regular expression.
** 'L' (or 'Length'): The length of the match in the subject text.
+
;«return»: Specifies what information should be returned, as follows:
** 'S' (or 'SubPattern'): The subtext matched by the pattern
+
:<code>'P'</code> (or <code>'Position'</code>): The position in the subject ''text'' where the matched pattern was found, or zero if not found.
** '#' (or '#SubPatterns'): The number of subpatterns in the regular expression.
+
:<code>'L'</code> (or <code>'Length'</code>): The length of the match in the subject text.
* ''subPattern'': Which subpattern to return information on.  See below.
+
:<code>'S'</code> (or <code>'SubPattern'</code>): The subtext matched by the pattern
 +
:<code>'#'</code> (or <code>'#SubPatterns'</code>): The number of subpatterns in the regular expression.
 +
;«subPattern»: Which subpattern to return information on.  See below.
  
 
When using [[FindInText]], you have four different options for what information can be returned.  By default, the position of the match (or zero if there is no match) is returned, but alternatively you can have it return the length of the match or the actual text that was successfully matched to .  For example:
 
When using [[FindInText]], you have four different options for what information can be returned.  By default, the position of the match (or zero if there is no match) is returned, but alternatively you can have it return the length of the match or the actual text that was successfully matched to .  For example:
  
:[[FindInText]]("[an]+", "A banana in a cabana", re:1, return:'S') &rarr; "anana"
+
:<code>FindInText("[an]+", "A banana in a cabana", re: 1, return: 'S') &rarr; "anana"</code>
  
If you want to obtain multiple items of information (such as the position, location and matching text) all at the same time, without repeating the match, pass an array to the ''return'' parameter.
+
If you want to obtain multiple items of information (such as the position, location and matching text) all at the same time, without repeating the match, pass an array to the «return» parameter.
  
== Subpatterns ==
+
=== Subpatterns ===
  
You can group subpatterns in a regular expression using parentheses.  You can then extract the values matches to a particular subpattern by specifying which subpattern you are interested in using the ''subPattern'' parameter.  The zeroth subpattern always corresponds to the full pattern, and from there grouped expressions are numbered in a depth-first order.  You can also specify a group using parentheses whose contents is not to be retained using ''(?:...)''
+
You can group subpatterns in a regular expression using parentheses.  You can then extract the values matches to a particular subpattern by specifying which subpattern you are interested in using the «subPattern» parameter.  The zeroth subpattern always corresponds to the full pattern, and from there grouped expressions are numbered in a depth-first order.  You can also specify a group using parentheses whose contents is not to be retained using <code>(?:...)</code>
  
 
For example:
 
For example:
  
Index I := 0..4;
+
:<code>Index I := 0..4;</code>
[[FindInText]]("([\w_]+)\s*:\s*((\d*,){4})(\d*),", "NodeInfo: 1,1,1,1,1,1,0,,0,", re:1, return:'S', subPattern:I)
+
:<code>FindInText("([\w_]+)\s*:\s*((\d*,){4})(\d*), ", "NodeInfo: 1, 1, 1, 1, 1, 1, 0, , 0, ", re: 1, return: 'S', subPattern: I) &rarr;</code>
&rarr;
+
:{|class="wikitable"
{| border="1"
+
  ! 0  
  ! 0 || "NodeInfo : 1,1,1,1,1,"
+
| "NodeInfo: 1, 1, 1, 1, 1, "
 
  |-
 
  |-
  ! 1 || "NodeInfo"  
+
  ! 1  
 +
| "NodeInfo"  
 
  |-
 
  |-
  ! 2 || "1,1,1,1,"
+
  ! 2  
 +
| "1, 1, 1, 1, "
 
  |-
 
  |-
  ! 3 || "1,"
+
  ! 3  
 +
| "1, "
 
  |-
 
  |-
  ! 4 || "1"
+
  ! 4  
 +
| "1"
 
  |}
 
  |}
  
You can see here that ''subPattern:4'' in this example extracts the 5th number in the comma-separated list.
+
You can see here that <code>subPattern: 4</code> in this example extracts the 5th number in the comma-separated list.
  
To figure out how many subPatterns are present, you can set the ''return'' parameter to '#'.  If ''return'' contains only '#' (i.e., it isn't an array with other 'P', 'L' or 'S' elements), it will determine the number of subPatterns in the regular expression without actually executing a matching search.  Thus, if you wanted to pass an index to ''subPattern'', you can figure out how long to make the index before executing the match.
+
To figure out how many subPatterns are present, you can set the «return» parameter to '#'.  If ''return'' contains only '#' (i.e., it isn't an array with other 'P', 'L' or 'S' elements), it will determine the number of subPatterns in the regular expression without actually executing a matching search.  Thus, if you wanted to pass an index to «subPattern», you can figure out how long to make the index before executing the match.
  
There can be many groupings, and the number and order of groups may change as you debug your regular expression, so using numbered subpatterns is not always the best.  You can instead use named subpatterns.  The syntax for naming a group is: (?<name>...), or (?'name'....) or (?P<name>...).  When you have named a subpattern, you can extract its value by passing the textual name to the ''subPattern'' parameter.
+
There can be many groupings, and the number and order of groups may change as you debug your regular expression, so using numbered subpatterns is not always the best.  You can instead use named subpatterns.  The syntax for naming a group is: (?<name>...), or (?'name'....) or (?P<name>...).  When you have named a subpattern, you can extract its value by passing the textual name to the «subPattern» parameter.
  
  [[FindInText]]("([\w_]+)\s*:\s*((\d*,){4})(?<border>\d*),",  
+
<pre style="background:white; border:white; margin-left: 1em;">
                 "NodeInfo: 1,1,1,1,1,1,0,,0,", re:1,  
+
  FindInText("([\w_]+)\s*:\s*((\d*,){4})(?<border>\d*),",  
                 return:'S', subPattern:'border') &rarr; "1"
+
                 "NodeInfo: 1, 1, 1, 1, 1, 1, 0, , 0,", re: 1,  
 +
                 return: 'S', subPattern: 'border') &rarr; "1"
 +
</pre>
  
== Duplicate Subpatterns ==
+
=== Duplicate Subpatterns ===
  
 
Cases frequently arise in which there are two or more alternative syntaxes for subpattern, requiring two subpatterns within the regular expression to have the same name, but usually these are disjunctive.  For example, in a standard Excel-compatible CSV format, a cell with no comma or new-line characters does not need to be quoted, but if the cell contains a comma, quotes must be placed around it.  For example, a line of a CSV file might be:
 
Cases frequently arise in which there are two or more alternative syntaxes for subpattern, requiring two subpatterns within the regular expression to have the same name, but usually these are disjunctive.  For example, in a standard Excel-compatible CSV format, a cell with no comma or new-line characters does not need to be quoted, but if the cell contains a comma, quotes must be placed around it.  For example, a line of a CSV file might be:
  
:San Jose,"1,006,102",10,Chuck Reed,"SJ,San José,SJC"
+
:<code>San Jose, "1, 006, 102", 10, Chuck Reed, "SJ, San José, SJC"</code>
  
 
This CSV entry has 5 items separated by commas, but two items have internal commas and thus are quoted.  Thus, each item matches one of two possible regular expressions, either: <code>([^,]+)</code> or <code>"(.*?)"</code>.  Notice that the parenthesis in the second case do not include the quotes, since we do not which to include that in the pattern.  To match either, we form a disjunction, but since they refer to the item, we name both branches with the same subpattern name:
 
This CSV entry has 5 items separated by commas, but two items have internal commas and thus are quoted.  Thus, each item matches one of two possible regular expressions, either: <code>([^,]+)</code> or <code>"(.*?)"</code>.  Notice that the parenthesis in the second case do not include the quotes, since we do not which to include that in the pattern.  To match either, we form a disjunction, but since they refer to the item, we name both branches with the same subpattern name:
  
("(?<city>.+?)")|(?<city>[^,]+?),\s*("(?<pop>.+?)")|(?<pop>[^,]+?)
+
:<code>("(?<city>.+?)")|(?<city>[^,]+?),\s*("(?<pop>.+?)")|(?<pop>[^,]+?)</code>
  
 
Because the two subpatterns named ''city'' are disjunctive, only one of them will match.  So, when you request the subpattern "city", you'll get the one which matched (the second in the example).  Similarly, only one "pop" subpattern will match, in this example the first, so you'll get info for the one that actually matched.
 
Because the two subpatterns named ''city'' are disjunctive, only one of them will match.  So, when you request the subpattern "city", you'll get the one which matched (the second in the example).  Similarly, only one "pop" subpattern will match, in this example the first, so you'll get info for the one that actually matched.
Line 162: Line 185:
 
You could have multiple matches to a subpattern (either named or numbered), as occurs with the regular expression "b(a)*c" applied to "dbaaacd".  There is a limitation here in that you can only get the data for one of the repeated matches, the last one.
 
You could have multiple matches to a subpattern (either named or numbered), as occurs with the regular expression "b(a)*c" applied to "dbaaacd".  There is a limitation here in that you can only get the data for one of the repeated matches, the last one.
  
= Splitting on a Pattern =
+
== Splitting on a Pattern ==
  
 
You can provide a regular expression as the separator to the [[SplitText]] function.  This makes it possible to split text into parts in such a way that allows multiple types of separators, variable length separators, or uncertainty about what the separator will be.
 
You can provide a regular expression as the separator to the [[SplitText]] function.  This makes it possible to split text into parts in such a way that allows multiple types of separators, variable length separators, or uncertainty about what the separator will be.
  
 
For example, to split on any punctuation character:
 
For example, to split on any punctuation character:
  [[SplitText]]( text, "[\.\?,!]", re:1 )
+
:<code>SplitText(text, "[\.\?,!]", re: 1)</code>
  
 
Or to split on any number of spaces, so that you don't get blank spaces between separators:
 
Or to split on any number of spaces, so that you don't get blank spaces between separators:
  
  [[SplitText]](text, "\s+", re:1 )
+
:<code>SplitText(text, "\s+", re: 1)</code>
  
Notice that the parameter ''re:1'' must be specified to cause the separator to be interpreted as a regular expression.
+
Notice that the parameter <code>re: 1</code> must be specified to cause the separator to be interpreted as a regular expression.
  
= Substitutions =
+
== Substitutions ==
  
The [[TextReplace]] function accepts a regular expression as its pattern when the ''re:1'' parameter is specified.
+
The [[TextReplace]] function accepts a regular expression as its pattern when the <code>re: 1</code> parameter is specified.
  
====[[TextReplace]](text,pattern,subst'',all,caseInsensitive,re)====
+
===TextReplace(text, pattern, subst'', all, caseInsensitive, re'')===
  
* ''text'' : the subject text being matched to
+
Parameters:
* ''pattern'': The regular expression  
+
;«text»: the subject text being matched to
* ''subst'' : the text to be substituted for the subtext that matches pattern
+
;«pattern»: The regular expression  
* ''all'' : 0=replace only the first occurrence (default), 1=replace every occurrence
+
;«subst»: the text to be substituted for the subtext that matches pattern
* ''caseInsensitive'': 1='A' matches 'a', etc.  CaseSensitive by default.
+
;«all»:
* ''re'': Must be set to 1 for regular expressions
+
:<code>0</code> = replace only the first occurrence (default)
 +
:<code>1</code> = replace every occurrence
 +
;«caseInsensitive»: 1='A' matches 'a', etc.  CaseSensitive by default.
 +
;«re»: Must be set to <code>1</code> for regular expressions
  
 
It is recommended that you use a named-parameter calling syntax for the optional parameters.  Here are some examples:
 
It is recommended that you use a named-parameter calling syntax for the optional parameters.  Here are some examples:
  
:TextReplace("3.141592654", "1|5|9", "0", re:1 ) &rarr; "3.041592564"
+
:<code>TextReplace("3.141592654", "1|5|9", "0", re: 1 ) &rarr; "3.041592564"</code>
:TextReplace("3.141592654", "1|5|9", "0", re:1, all:1 ) &rarr; "3.040002604"
+
:<code>TextReplace("3.141592654", "1|5|9", "0", re: 1, all: 1) &rarr; "3.040002604"</code>
:TextReplace("3.141592654", "(1|5|9)+", "0", re:1, all:1 ) &rarr; "3.140002654"
+
:<code>TextReplace("3.141592654", "(1|5|9)+", "0", re: 1, all: 1) &rarr; "3.140002654"</code>
  
== SubPattern Substitutions ==
+
=== SubPattern Substitutions ===
  
When regular expressions are used, the ''subst'' parameter may refer to subPattern groupings that appear in the ''pattern'' parameter.  The matching text for those is substituted accordingly.  ''\0'' denotes the full text matched by the full regular expression, ''\1'' is the first subpattern, ''\2'' the second, up to ''\9''.   
+
When regular expressions are used, the «subst» parameter may refer to subPattern groupings that appear in the «pattern» parameter.  The matching text for those is substituted accordingly.  ''\0'' denotes the full text matched by the full regular expression, ''\1'' is the first subpattern, ''\2'' the second, up to ''\9''.   
  
You can also refer to named subpatterns using ''<name>'' in the ''subst'' parameter.  Again, the subtext matching the corresponding named subpattern is substituted.  Some examples:
+
You can also refer to named subpatterns using ''<name>'' in the «subst» parameter.  Again, the subtext matching the corresponding named subpattern is substituted.  Some examples:
  
:TextReplace("3.141592", "(\d)", "\1\1", re:1 ) &rarr; "33.141592"
+
:<code>TextReplace("3.141592", "(\d)", "\1\1", re: 1) &rarr; "33.141592"</code>
:TextReplace("3.141592", "(\d)", "\1\1", re:1, all:1 ) &rarr; "33.114411559922"
+
:<code>TextReplace("3.141592", "(\d)", "\1\1", re: 1, all: 1) &rarr; "33.114411559922"</code>
:TextReplace("time", "(.)(.)(.)(.)", "\4\3\2\1", re:1, all:1 ) &rarr; "emit"
+
:<code>TextReplace("time", "(.)(.)(.)(.)", "\4\3\2\1", re: 1, all: 1) &rarr; "emit"</code>
:TextReplace("543,632","(?<x>\d+),(?<y>\d+)", "<y>,<x>", re:1, all:1) &rarr; "632,543"
+
:<code>TextReplace("543,632","(?<x>\d+),(?<y>\d+)", "<y>,<x>", re: 1, all: 1) &rarr; "632,543"</code>
 
 
= Credits =
 
  
 +
== Credits ==
 
Analytica makes use of the ''Perl Compatible Regular Expression'' library, written by Philip Hazel (email: ph10 at cam.ac.uk) of the University of Cambridge Computing Service, Cambridge, England.
 
Analytica makes use of the ''Perl Compatible Regular Expression'' library, written by Philip Hazel (email: ph10 at cam.ac.uk) of the University of Cambridge Computing Service, Cambridge, England.
 
Copyright (c) 1997-2008 University of Cambridge
 
Copyright (c) 1997-2008 University of Cambridge
Line 212: Line 237:
  
 
The library is included in Analytica under the "BSD" license published with the PCRE release 7.8 distributable.
 
The library is included in Analytica under the "BSD" license published with the PCRE release 7.8 distributable.
 +
 +
== See Also ==
 +
* [http://www.newlisp.org/downloads/pcrepattern.html PCRE Pattern Man Page] -- for in-depth info on the full set of patterns available.
 +
* [[FindInText]]
 +
* [[SplitText]]
 +
* [[TextReplace]]

Latest revision as of 22:59, 21 December 2017


Regular expressions are a concise, powerful, if cryptic, formalism to specify a pattern with wild cards and so on to match desired text strings. They are very useful for text search and parsing. They play a prominent role in several widely-used programming languages, notably Perl and Python.

Analytica includes powerful regular expression processing, similar to Perl, in some built-in text functions, notably FindInText, SplitText, TextReplace, and FindObjects. Each of these functions has a parameter specifying text to find, which it interprets as a regular expression when you specify optional parameter: re: True. For example:

 { To find the position of a seven-letter word: }
 FindInText("\b\w{7}\b","Now is the time for all good men to come to the aid of their country", re:  1) → 62

The pattern "\b\w{7}\b" specifies a separator "\b", such as space, followed by seven "{7}" letters "w", ending in another separator "\b".

 { Split on any word having two repeated letters, }
 SplitText("When in the course of human events, it becomes necessary for ...", "[^\w]*\b\w*(\w)\w*\1\w*\b[^\w]*", re: 1) →
         ["When in the course of human", "it", "", "for ..."]

Basics of Regular Expressions

A regular expression may contain literal (i.e. uninterpreted) characters, such as the letters and digits, that must match the same character in the found text, and special characters that specify wildcard patterns and special actions. A simple sequence of literal characters, like 'this', is a simple regular expression that matches exactly that sequence of characters wherever it occurs in the subject text.

The power of regular expressions comes from the special characters and codes that specify a class of matching patterns. For example, the dot character means any character, so "t..s" matches any text with a "t", followed by any two characters, followed by "s", such as "this", "thus", "t as", etc. If you want to match a literal dot in the subject text, you should precede the dot with a backslash "\" escape character. So. "t.\." matches "th." and "ts." but not "ths". This goes for all special characters: If you mean it as literal, precede it by backslash.

Special characters used in Regular Expressions:

  \        escape character -- specify the next special character as a literal (which would otherwise be interpreted as a special character) --
 e.g. '\\' matches '\'.  Or  specify the next letter as as special code (which would otherwise be interpreted as a literal character) -- e.g. '\t' matches the tab character.
  ^       start of a string (or line, in multiline mode)
  $       end of string (or line, in multiline mode)
  .        match any character except newline (by default)
  [        start definition of a character class 
  ]         end definition a character class definition
  |        start of alternative branch
  (        start a subpattern
  )        end a subpattern
  ?        extends the meaning of (
           also 0 or 1 quantifier
           also quantifier minimizer
  *        0 or more of the previous character or subpattern
  +        1 or more of the previous character or subpattern
           also "possessive quantifier"
  {        start min/max quantifier
  \Q...\E  Treat all characters between \Q and \E as literals

A character class specifies a set of possible characters. It is part of a pattern within square brackets. The only special characters in a character class are:

  \      escape character
  ^      negate the class, but only if the first character
  -      indicates character range
  [      POSIX character class (only if followed by POSIX syntax)
  ]      terminates the character class

You can refer to non-printing characters thus:

  \a        alarm, that is, the BEL character (hex 07)
  \cx       "control-x", where x is any character
  \e        escape (hex 1B)
  \f        formfeed (hex 0C)
  \n        linefeed (hex 0A)
  \r        carriage return (hex 0D)
  \t        tab (hex 09)
  \R        any newline character, equivalent to (?>\r\n|\n|\x0b|\f|\r|\x85)
  \ddd      character with octal code ddd, or backreference
  \xhh      character with hex code hh
  \x{hhh..} character with hex code hhh..

Several character groups have special escape sequences, including:

    \w	     A "word" character -- letter, digit, or underscore "_"
    \W	     Non-"word" character -- any character that is not a letter, digit, or underscore
    \s	             Whitespace character-- space, tab, newline
    \S	     Non-whitespace character
    \d	     Digit character --  0 to 9.
    \D	     Non-digit character -- any character that is not a digit

And several escape characters match particular points within text that correspond to a position but not to an actual character:

  \b     matches at a word boundary
  \B     matches when not at a word boundary
  \A     matches at the start of the subject
  \Z     matches at the end of the subject
          also matches before a newline at the end of the subject
  \z     matches only at the end of the subject
  \G     matches at the first matching position in the subject

The full specification of regular expression patterns supported is described at Pcre Patterns Man Page.

Multi-line matching

By default, a regular expression can be used to match over multiple lines in the source text. The caret (^) and dollar ($) patterns match only the very first and very last character in the entire text, and don't match to the first character in a particular line. The \R pattern can be used to match line breaks, and is equivalent to (?:\r\n)|\r|\n. Line breaks in Analytica are usually \r, but \R is more robust in that it matches all three newline conventions.

You can instruct the matcher to operate in a multi-line mode, in which the text is treated as if composed of separate lines, where a pattern exists on a single line. In this mode, caret (^) matches each line start and dollar ($) matches each line end. To use this mode, begin the regular expression with (?m).

In theory (according to the Pcre library documenation), you should be able to control which newline character combinations are recognized as the beginning and end of the line. We haven't seen this work, so it may not actually have an effect. To indicate that any newline character combination should be recognized, start the regular expression with (*ANY), as in: "(*ANY)^\w\d{5}" (which would match to a line within the text beginning with a letter and 5 digits). The (*ANY) prefix considers any standard new-line combination (CR, LF, CRLF) to denote a line break.

Three conventions exist for new lines in text file formats. CR is the standard on the Mac. LF is standard on Unix. CRLF (two characters) is the standard in Windows. Analytica's functions like ReadTextFile typically convert to just CR. Excel on Windows (and in CSV files) may use CR for new-rows and LF for new-lines within a single cell. So, depending on where your data is coming from, there are sometimes cases in which you may want to use a multi-line mode, but only with a particular new-line character or combination recognized. The (*ANY) prefix recognizes any of these standard conventions as denoting a newline. (*CR) recognizes only CR, (*LF) recognizes on LF, and (*CRLF) recognizes only the CRLF combination. Note that each of these is a prefix that puts the matcher into a multi-line mode -- the character combinations (*CR) would not appear within the regular expression.

Finding Patterns in Text

The FindInText function, with several optional parameters, can be used to find patterns in text.

FindInText(pattern, text, caseInsensitive, re, return, subPattern)

Parameters:

«pattern»
the regular expression
«text»
the subject text being searched
«caseInsensitive»
When set to 1, matches 'a' to 'A', etc. Matches are case-sensitive by default.
«re»
Must be non-zero for pattern to be interpreted as a regular expression.
«return»
Specifies what information should be returned, as follows:
'P' (or 'Position'): The position in the subject text where the matched pattern was found, or zero if not found.
'L' (or 'Length'): The length of the match in the subject text.
'S' (or 'SubPattern'): The subtext matched by the pattern
'#' (or '#SubPatterns'): The number of subpatterns in the regular expression.
«subPattern»
Which subpattern to return information on. See below.

When using FindInText, you have four different options for what information can be returned. By default, the position of the match (or zero if there is no match) is returned, but alternatively you can have it return the length of the match or the actual text that was successfully matched to . For example:

FindInText("[an]+", "A banana in a cabana", re: 1, return: 'S') → "anana"

If you want to obtain multiple items of information (such as the position, location and matching text) all at the same time, without repeating the match, pass an array to the «return» parameter.

Subpatterns

You can group subpatterns in a regular expression using parentheses. You can then extract the values matches to a particular subpattern by specifying which subpattern you are interested in using the «subPattern» parameter. The zeroth subpattern always corresponds to the full pattern, and from there grouped expressions are numbered in a depth-first order. You can also specify a group using parentheses whose contents is not to be retained using (?:...)

For example:

Index I := 0..4;
FindInText("([\w_]+)\s*:\s*((\d*,){4})(\d*), ", "NodeInfo: 1, 1, 1, 1, 1, 1, 0, , 0, ", re: 1, return: 'S', subPattern: I) →
0 "NodeInfo: 1, 1, 1, 1, 1, "
1 "NodeInfo"
2 "1, 1, 1, 1, "
3 "1, "
4 "1"

You can see here that subPattern: 4 in this example extracts the 5th number in the comma-separated list.

To figure out how many subPatterns are present, you can set the «return» parameter to '#'. If return contains only '#' (i.e., it isn't an array with other 'P', 'L' or 'S' elements), it will determine the number of subPatterns in the regular expression without actually executing a matching search. Thus, if you wanted to pass an index to «subPattern», you can figure out how long to make the index before executing the match.

There can be many groupings, and the number and order of groups may change as you debug your regular expression, so using numbered subpatterns is not always the best. You can instead use named subpatterns. The syntax for naming a group is: (?<name>...), or (?'name'....) or (?P<name>...). When you have named a subpattern, you can extract its value by passing the textual name to the «subPattern» parameter.

 FindInText("([\w_]+)\s*:\s*((\d*,){4})(?<border>\d*),", 
                "NodeInfo: 1, 1, 1, 1, 1, 1, 0, , 0,", re: 1, 
                return: 'S', subPattern: 'border') → "1"

Duplicate Subpatterns

Cases frequently arise in which there are two or more alternative syntaxes for subpattern, requiring two subpatterns within the regular expression to have the same name, but usually these are disjunctive. For example, in a standard Excel-compatible CSV format, a cell with no comma or new-line characters does not need to be quoted, but if the cell contains a comma, quotes must be placed around it. For example, a line of a CSV file might be:

San Jose, "1, 006, 102", 10, Chuck Reed, "SJ, San José, SJC"

This CSV entry has 5 items separated by commas, but two items have internal commas and thus are quoted. Thus, each item matches one of two possible regular expressions, either: ([^,]+) or "(.*?)". Notice that the parenthesis in the second case do not include the quotes, since we do not which to include that in the pattern. To match either, we form a disjunction, but since they refer to the item, we name both branches with the same subpattern name:

("(?<city>.+?)")|(?<city>[^,]+?),\s*("(?<pop>.+?)")|(?<pop>[^,]+?)

Because the two subpatterns named city are disjunctive, only one of them will match. So, when you request the subpattern "city", you'll get the one which matched (the second in the example). Similarly, only one "pop" subpattern will match, in this example the first, so you'll get info for the one that actually matched.

You could have multiple matches to a subpattern (either named or numbered), as occurs with the regular expression "b(a)*c" applied to "dbaaacd". There is a limitation here in that you can only get the data for one of the repeated matches, the last one.

Splitting on a Pattern

You can provide a regular expression as the separator to the SplitText function. This makes it possible to split text into parts in such a way that allows multiple types of separators, variable length separators, or uncertainty about what the separator will be.

For example, to split on any punctuation character:

SplitText(text, "[\.\?,!]", re: 1)

Or to split on any number of spaces, so that you don't get blank spaces between separators:

SplitText(text, "\s+", re: 1)

Notice that the parameter re: 1 must be specified to cause the separator to be interpreted as a regular expression.

Substitutions

The TextReplace function accepts a regular expression as its pattern when the re: 1 parameter is specified.

TextReplace(text, pattern, subst, all, caseInsensitive, re)

Parameters:

«text»
the subject text being matched to
«pattern»
The regular expression
«subst»
the text to be substituted for the subtext that matches pattern
«all»
0 = replace only the first occurrence (default)
1 = replace every occurrence
«caseInsensitive»
1='A' matches 'a', etc. CaseSensitive by default.
«re»
Must be set to 1 for regular expressions

It is recommended that you use a named-parameter calling syntax for the optional parameters. Here are some examples:

TextReplace("3.141592654", "1|5|9", "0", re: 1 ) → "3.041592564"
TextReplace("3.141592654", "1|5|9", "0", re: 1, all: 1) → "3.040002604"
TextReplace("3.141592654", "(1|5|9)+", "0", re: 1, all: 1) → "3.140002654"

SubPattern Substitutions

When regular expressions are used, the «subst» parameter may refer to subPattern groupings that appear in the «pattern» parameter. The matching text for those is substituted accordingly. \0 denotes the full text matched by the full regular expression, \1 is the first subpattern, \2 the second, up to \9.

You can also refer to named subpatterns using <name> in the «subst» parameter. Again, the subtext matching the corresponding named subpattern is substituted. Some examples:

TextReplace("3.141592", "(\d)", "\1\1", re: 1) → "33.141592"
TextReplace("3.141592", "(\d)", "\1\1", re: 1, all: 1) → "33.114411559922"
TextReplace("time", "(.)(.)(.)(.)", "\4\3\2\1", re: 1, all: 1) → "emit"
TextReplace("543,632","(?<x>\d+),(?<y>\d+)", "<y>,<x>", re: 1, all: 1) → "632,543"

Credits

Analytica makes use of the Perl Compatible Regular Expression library, written by Philip Hazel (email: ph10 at cam.ac.uk) of the University of Cambridge Computing Service, Cambridge, England. Copyright (c) 1997-2008 University of Cambridge All rights reserved.

The library is included in Analytica under the "BSD" license published with the PCRE release 7.8 distributable.

See Also

Comments


You are not allowed to post comments.