Regular expression syntax
This topic includes the following:
About regular expressions
The Arctera Insight Classification supports a regular expression syntax that is based on the syntax in the Perl programming language. In Perl regular expressions, all characters match themselves except for the following special characters:
.[]{}()\*+?|^$
For more information on the Perl syntax, see the following webpage:
You may find it helpful to build and test your regular expressions using the free online tool at https://regex101.com. This tool displays an explanation of your regular expression as you type it, and also lists all matches between the regular expression and a test string of your choice. The default regular expression flavor, pcre (php), is compatible with the Arctera Insight Classification.
Note: Looking for regular expression matches is considerably slower than looking for matches for specific words or phrases. You can greatly improve performance and accuracy by looking for instances where both types of matches occur in proximity to each other. To do this, set up anAll ofcondition group that contains both a regular expression condition and acontains textcondition for finding specific words and phrases, and specify the required distance within which matches must occur. The Arctera Insight Classification first evaluates thecontains textcondition and only then looks for a regular expression match.
Wildcards
The
. (period) character matches any single character, when it is used outside of a character set.
Anchors
The
^ (caret) character matches the start of a line. The $ (dollar) character matches the end of a line.
Marked subexpressions
A section that is surrounded with the characters
( and ) acts as a marked subexpression. The matching algorithms captures whatever matches the subexpression. Marked subexpressions can be repeated, or a back-reference can refer to them.
Non-marking groupings
A marked subexpression is useful for lexically grouping part of a regular expression, but it has the side-effect of additional overhead. As an alternative, you can lexically group part of a regular expression without generating a marked subexpression by using
(?: and ). For example, (?:ab)+ repeats ab without splitting out any separate subexpressions.
Repeats
You can repeat any atom (single character, marked or non-marked subexpression, or character class) with the operators
*, +, ?, and {}.
Related information
Table: Repeat operators
| Operator | Description |
|---|---|
* |
Matches the preceding atom zero or more times. For example, a*b matches any of the following\: |
| lang=txt | |
| b | |
| ab | |
| aaaaaaaab | |
+ |
Matches the preceding atom one or more times. For example, a+b matches either of the following\: |
| lang=txt | |
| ab | |
| aaaaaaaab | |
However, it does not match b . |
|
? |
Matches the preceding atom zero or one times. For example, ca?b matches either of the following\: |
| lang=txt | |
| cb | |
| cab | |
However, it does not match caab . |
|
{} |
Repeats the preceding atom with a bounded repeat. |
a{n} matches a repeated exactly n times. |
|
a{n,} matches a repeated n or more times. |
|
a{n,m} matches a repeated between n and m times inclusive. |
|
For example, ^a{2,3}$ matches either of the following: |
|
| lang=txt | |
| aa | |
| aaa | |
However, it does not match a |
|
or aaaa . |
These operators are "greedy"; they consume as much input as possible. However, non-greedy versions are available that consume as little input as possible while still producing a match. By following the repeat operators
*, +, ?, and {} with the ? character, the repeats become non-greedy.
By default, when a repeated pattern does not match, the Arctera Insight Classification backtracks until it finds a match. This behavior can sometimes be undesirable for matchmaking or performance reasons, so there are also "possessive" repeats. These match as much as possible and do not then allow backtracking if the rest of the expression fails to match.
Back references
An escape character that is followed by a digit
n, where n is in the range 1 through 9, matches the same string that the subexpression n matched. For example, consider the following expression:
^(a{2,3}).*\1$
This matches
aaabbaaa, but it does not match aaabba.
Alternation
The
| operator matches either of its arguments. For example, abc|def matches both abc and def.
You can use parentheses to group alternations. For example,
ab(?:d|ef) matches both abd and abef.
Character sets
A character set is a bracket expression that is enclosed within the characters
[ and ]. It defines a set of characters, and matches any single character that is a member of the set.
A bracket expression can contain any combination of the following:
-
Single characters. For example,
[abc]matches any of the charactersa,b, orc. -
Character ranges. For example,
[a-c]matches any single character in the rangeathroughc. By default, for Perl regular expressions, a characterxis within the rangeytoz, if the code point of the character lies within the code points of the endpoints of the range. -
Negation. If the bracket expression begins with the
^character, it matches the complement of the characters that it contains. For example,[^a-c]matches any character that is not in the rangeathroughc. -
Character classes. An expression of the form
[[:name:]]matches the named character classname. For example,[[:lower:]]matches any lowercase character. The supported character classes are as follows:alnum Any alphanumeric character. punct Any punctuation character. alpha Any alphabetic character. s Any whitespace character. blank Any whitespace character that is not a line separator. space Any whitespace character. cntrl Any control character. unicode Any extended character whose code point is above 255 in value. d Any decimal digit. u Any uppercase character. digit Any decimal digit. upper Any uppercase character. graph Any graphical character. w Any word character (alphanumeric characters plus the underscore). l Any lowercase character. word Any word character (alphanumeric characters plus the underscore). lower Any lowercase character. xdigit Any hexadecimal digit character. print Any printable character. - - -
Escaped characters. All the escape sequences that match a single character or character class are permitted within a character class definition. For example,
[[]]matches both[and], whereas[\W\d]matches any character that is either a digit or not a word character. -
Combinations. You can combine one or more of the above in a character set declaration. For example,
[a-cmnx-y\d].
Escapes
Any special character that is preceded by an escape matches itself.
Table: Escape sequences that are synonyms for single characters
| Escape | Character |
|---|---|
| \a | \a |
| \e | 0x1B |
| \f | \f |
| \n | \n |
| \r | \r |
| \t | \t |
| \v | \v |
| \b | \b (but only inside a character class declaration). |
| \c X | An ASCII escape sequence: the character whose code point is X % 32. |
| \x XX | A hexadecimal escape sequence: matches the single character whose code point is 0x XX . |
| \x{ XXXX } | A hexadecimal escape sequence: matches the single character whose code point is 0x XXXX . |
| \0 ddd | An octal escape sequence: matches the single character whose code point is 0 ddd . |
| \N{ name } | Matches the single character that has the symbolic name name . For example, \N{newline} matches the single character \n. |
"Single character" character classes
When
x is the name of a character class, the escaped character x matches any character that is a member of the class. Conversely, X matches any character that is not a member of the x class.
Table: Escape sequences for "single character" character classes
| Escape | Equivalent to | Escape | Equivalent to |
|---|---|---|---|
| \d | \[\[:digit:\\]] | \D | \[^\[:digit:\\]] |
| \l | \[\[:lower:\\]] | \L | \[^\[:lower:\\]] |
| \s | \[\[:space:\\]] | \S | \[^\[:space:\\]] |
| \u | \[\[:upper:\\]] | \U | \[^\[:upper:\\]] |
| \w | \[\[:word:\\]] | \W | \[^\[:word:\\]] |
| \h | Horizontal whitespace | \H | Not horizontal whitespace |
| \v | Vertical whitespace | \V | Not vertical whitespace |
Word boundaries
The following escape sequences match boundaries of words.
Table: Escape sequences for word boundaries
| Escape | Description |
|---|---|
| \< | Matches the start of a word. |
| \> | Matches the end of a word. |
| \b | Matches a word boundary (the start or end of a word). |
| \B | Matches only when not at a word boundary. |
Line endings
The following escape sequences match line endings.
Table: Escape sequences for line endings
| Escape | Description |
|---|---|
| \n | Newline. |
| \r | CR. |
| \R | Any line-ending character sequence. This is identical to the following expression\: |
(?>\x0D\x0A?\|\[\x0A-\x0C\x85\x{2028}\x{2029}\]) |
Other escapes
Except for the following characters, any escape sequence matches the character that is escaped:
' ` A C E G K Q X z Z
```txt
For example,``\@``matches a literal``@``.
Perl-specific extensions
All Perl-specific extensions to the regular expression syntax start with``(?``.
Named subexpressions
You can create a named subexpression as follows:```txt
(?\<NAME\>expression)
```txt
You can then refer to the subexpression by the name``NAME``. Alternatively, you can delimit the name, as in the following:```txt
(?'NAME'expression)
```txt
You can then refer to the subexpression in a backreference using either``\g{NAME}``or``\k<NAME>``.
Comments``(?# ... )``is treated as a comment. Its contents are ignored.
Modifiers``(?imsx-imsx ... )``alters which of the Perl modifiers are in effect within the pattern. Changes take effect from the point that the block is first seen and extend to any enclosing``)``. Letters before a``'-'``turn this Perl modifier on, and those after the``'-'``turn it off.``(?imsx-imsx:pattern)``applies the specified modifiers to``pattern``only.
Non-marking groups``(?:pattern)``lexically groups``pattern``, without generating an additional subexpression.
Lookahead``(?=pattern)``consumes zero characters, but only if``pattern``matches.``(?!pattern)``consumes zero characters, but only if``pattern``does not match.
You typically use lookahead to create the logical AND of two regular expressions. For example, if a password must contain a lowercase letter, uppercase letter, and punctuation symbol, and it must be at least six characters long, then you can use the following expression to validate the password:```txt
(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}
```txt
Lookbehind``(?<=pattern)``consumes zero characters, but only if``pattern``can be matched against the characters that precede the current position (``pattern``must be of fixed length).``(?<!pattern)``consumes zero characters, but only if``pattern``cannot be matched against the characters that precede the current position (``pattern``must be of fixed length).
Independent subexpressions``(?>pattern)``matches``pattern``independently of the surrounding patterns. The expression never backtracks into``pattern``.
Conditional expressions``(?(condition)yes-pattern|no-pattern)``tries to match``yes-pattern``if the condition is``true``, and otherwise tries to match``no-pattern``.``(?(condition)yes-pattern)``tries to match``yes-pattern``if the condition is``true``, and otherwise matches the``NULL``string.``condition``may be one of the following:
- A forward lookahead assert.
- The index of a marked subexpression (the condition becomes true if the subexpression has been matched).
Here is a summary of the possible predicates:
|``(?(?=assert)yes-pattern\|no-pattern)``| Executes``yes-pattern``if the forward look-ahead assert matches, and otherwise executes``no-pattern``. |
| --- | --- |
|``(?(?!assert)yes-pattern\|no-pattern)``| Executes``yes-pattern``if the forward look-ahead assert does not match, and otherwise executes``no-pattern``. |
|``(?(N)yes-pattern\|no-pattern)``| Executes``yes-pattern``if subexpression N has been matched, and otherwise executes``no-pattern``. |
|``(?()yes-pattern\|no-pattern)``| Executes``yes-pattern``if named subexpression``name``has been matched, and otherwise executes``no-pattern``. |
|``(?('name')yes-pattern\|no-pattern)``| Executes``yes-pattern``if named subexpression``name``has been matched, and otherwise executes``no-pattern``. |
Operator precedence
The order of precedence for the operators is as follows:
- Escaped characters``\``- Character set (bracket expression)``[]``- Grouping``()``- Single-character-ERE duplication``* + ? {m,n}``- Concatenation
- Anchoring``^$``- Alternation``|``