Regular expression syntax

Last published : Apr 17, 2026
This topic includes the following:
About regular expressions
The Arctera Insight Classification supports a regular expression syntax that is based on the syntax in the Perl programming language. In Perl regular expressions, all characters match themselves except for the following special characters:
.[]{}()\*+?|^$
For more information on the Perl syntax, see the following webpage:
You may find it helpful to build and test your regular expressions using the free online tool at https://regex101.com. This tool displays an explanation of your regular expression as you type it, and also lists all matches between the regular expression and a test string of your choice. The default regular expression flavor, pcre (php), is compatible with the Arctera Insight Classification.
Note: Looking for regular expression matches is considerably slower than looking for matches for specific words or phrases. You can greatly improve performance and accuracy by looking for instances where both types of matches occur in proximity to each other. To do this, set up anAll ofcondition group that contains both a regular expression condition and acontains textcondition for finding specific words and phrases, and specify the required distance within which matches must occur. The Arctera Insight Classification first evaluates thecontains textcondition and only then looks for a regular expression match.
Wildcards
The . (period) character matches any single character, when it is used outside of a character set.
Anchors
The ^ (caret) character matches the start of a line. The $ (dollar) character matches the end of a line.
Marked subexpressions
A section that is surrounded with the characters ( and ) acts as a marked subexpression. The matching algorithms captures whatever matches the subexpression. Marked subexpressions can be repeated, or a back-reference can refer to them.
Non-marking groupings
A marked subexpression is useful for lexically grouping part of a regular expression, but it has the side-effect of additional overhead. As an alternative, you can lexically group part of a regular expression without generating a marked subexpression by using (?: and ). For example, (?:ab)+ repeats ab without splitting out any separate subexpressions.
Repeats
You can repeat any atom (single character, marked or non-marked subexpression, or character class) with the operators *, +, ?, and {}.
Related information

Table: Repeat operators

Operator Description
* Matches the preceding atom zero or more times. For example, a*b matches any of the following\:
lang=txt
b
ab
aaaaaaaab
+ Matches the preceding atom one or more times. For example, a+b matches either of the following\:
lang=txt
ab
aaaaaaaab
However, it does not match b .
? Matches the preceding atom zero or one times. For example, ca?b matches either of the following\:
lang=txt
cb
cab
However, it does not match caab .
{} Repeats the preceding atom with a bounded repeat.
a{n} matches a repeated exactly n times.
a{n,} matches a repeated n or more times.
a{n,m} matches a repeated between n and m times inclusive.
For example, ^a{2,3}$ matches either of the following:
lang=txt
aa
aaa
However, it does not match a
or aaaa .
These operators are "greedy"; they consume as much input as possible. However, non-greedy versions are available that consume as little input as possible while still producing a match. By following the repeat operators *, +, ?, and {} with the ? character, the repeats become non-greedy.
By default, when a repeated pattern does not match, the Arctera Insight Classification backtracks until it finds a match. This behavior can sometimes be undesirable for matchmaking or performance reasons, so there are also "possessive" repeats. These match as much as possible and do not then allow backtracking if the rest of the expression fails to match.
Back references
An escape character that is followed by a digit n, where n is in the range 1 through 9, matches the same string that the subexpression n matched. For example, consider the following expression:
^(a{2,3}).*\1$
This matches aaabbaaa, but it does not match aaabba.
Alternation
The | operator matches either of its arguments. For example, abc|def matches both abc and def.
You can use parentheses to group alternations. For example, ab(?:d|ef) matches both abd and abef.
Character sets
A character set is a bracket expression that is enclosed within the characters [ and ]. It defines a set of characters, and matches any single character that is a member of the set.
A bracket expression can contain any combination of the following:
  • Single characters. For example, [abc] matches any of the characters a, b, or c.
  • Character ranges. For example, [a-c] matches any single character in the range a through c. By default, for Perl regular expressions, a character x is within the range y to z, if the code point of the character lies within the code points of the endpoints of the range.
  • Negation. If the bracket expression begins with the ^ character, it matches the complement of the characters that it contains. For example, [^a-c] matches any character that is not in the range a through c.
  • Character classes. An expression of the form [[:name:]] matches the named character class name. For example, [[:lower:]] matches any lowercase character. The supported character classes are as follows:
    alnum Any alphanumeric character. punct Any punctuation character.
    alpha Any alphabetic character. s Any whitespace character.
    blank Any whitespace character that is not a line separator. space Any whitespace character.
    cntrl Any control character. unicode Any extended character whose code point is above 255 in value.
    d Any decimal digit. u Any uppercase character.
    digit Any decimal digit. upper Any uppercase character.
    graph Any graphical character. w Any word character (alphanumeric characters plus the underscore).
    l Any lowercase character. word Any word character (alphanumeric characters plus the underscore).
    lower Any lowercase character. xdigit Any hexadecimal digit character.
    print Any printable character. - -
  • Escaped characters. All the escape sequences that match a single character or character class are permitted within a character class definition. For example, [[]] matches both [ and ], whereas [\W\d] matches any character that is either a digit or not a word character.
  • Combinations. You can combine one or more of the above in a character set declaration. For example, [a-cmnx-y\d].
Escapes
Any special character that is preceded by an escape matches itself.

Table: Escape sequences that are synonyms for single characters

Escape Character
\a \a
\e 0x1B
\f \f
\n \n
\r \r
\t \t
\v \v
\b \b (but only inside a character class declaration).
\c X An ASCII escape sequence: the character whose code point is X % 32.
\x XX A hexadecimal escape sequence: matches the single character whose code point is 0x XX .
\x{ XXXX } A hexadecimal escape sequence: matches the single character whose code point is 0x XXXX .
\0 ddd An octal escape sequence: matches the single character whose code point is 0 ddd .
\N{ name } Matches the single character that has the symbolic name name . For example, \N{newline} matches the single character \n.
"Single character" character classes
When x is the name of a character class, the escaped character x matches any character that is a member of the class. Conversely, X matches any character that is not a member of the x class.

Table: Escape sequences for "single character" character classes

Escape Equivalent to Escape Equivalent to
\d \[\[:digit:\\]] \D \[^\[:digit:\\]]
\l \[\[:lower:\\]] \L \[^\[:lower:\\]]
\s \[\[:space:\\]] \S \[^\[:space:\\]]
\u \[\[:upper:\\]] \U \[^\[:upper:\\]]
\w \[\[:word:\\]] \W \[^\[:word:\\]]
\h Horizontal whitespace \H Not horizontal whitespace
\v Vertical whitespace \V Not vertical whitespace
Word boundaries
The following escape sequences match boundaries of words.

Table: Escape sequences for word boundaries

Escape Description
\< Matches the start of a word.
\> Matches the end of a word.
\b Matches a word boundary (the start or end of a word).
\B Matches only when not at a word boundary.
Line endings
The following escape sequences match line endings.

Table: Escape sequences for line endings

Escape Description
\n Newline.
\r CR.
\R Any line-ending character sequence. This is identical to the following expression\:
(?>\x0D\x0A?\|\[\x0A-\x0C\x85\x{2028}\x{2029}\])
Other escapes
Except for the following characters, any escape sequence matches the character that is escaped:
' ` A C E G K Q X z Z
```txt

For example,``\@``matches a literal``@``.

Perl-specific extensions

All Perl-specific extensions to the regular expression syntax start with``(?``.

Named subexpressions

You can create a named subexpression as follows:```txt
(?\<NAME\>expression)

```txt
You can then refer to the subexpression by the name``NAME``. Alternatively, you can delimit the name, as in the following:```txt
(?'NAME'expression)
```txt

You can then refer to the subexpression in a backreference using either``\g{NAME}``or``\k<NAME>``.

Comments``(?# ... )``is treated as a comment. Its contents are ignored.

Modifiers``(?imsx-imsx ... )``alters which of the Perl modifiers are in effect within the pattern. Changes take effect from the point that the block is first seen and extend to any enclosing``)``. Letters before a``'-'``turn this Perl modifier on, and those after the``'-'``turn it off.``(?imsx-imsx:pattern)``applies the specified modifiers to``pattern``only.

Non-marking groups``(?:pattern)``lexically groups``pattern``, without generating an additional subexpression.

Lookahead``(?=pattern)``consumes zero characters, but only if``pattern``matches.``(?!pattern)``consumes zero characters, but only if``pattern``does not match.

You typically use lookahead to create the logical AND of two regular expressions. For example, if a password must contain a lowercase letter, uppercase letter, and punctuation symbol, and it must be at least six characters long, then you can use the following expression to validate the password:```txt
(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}

```txt
Lookbehind``(?<=pattern)``consumes zero characters, but only if``pattern``can be matched against the characters that precede the current position (``pattern``must be of fixed length).``(?<!pattern)``consumes zero characters, but only if``pattern``cannot be matched against the characters that precede the current position (``pattern``must be of fixed length).

Independent subexpressions``(?>pattern)``matches``pattern``independently of the surrounding patterns. The expression never backtracks into``pattern``.

Conditional expressions``(?(condition)yes-pattern|no-pattern)``tries to match``yes-pattern``if the condition is``true``, and otherwise tries to match``no-pattern``.``(?(condition)yes-pattern)``tries to match``yes-pattern``if the condition is``true``, and otherwise matches the``NULL``string.``condition``may be one of the following:

-  A forward lookahead assert.

-  The index of a marked subexpression (the condition becomes true if the subexpression has been matched).

Here is a summary of the possible predicates:

|``(?(?=assert)yes-pattern\|no-pattern)``| Executes``yes-pattern``if the forward look-ahead assert matches, and otherwise executes``no-pattern``. |
| --- | --- |
|``(?(?!assert)yes-pattern\|no-pattern)``| Executes``yes-pattern``if the forward look-ahead assert does not match, and otherwise executes``no-pattern``. |
|``(?(N)yes-pattern\|no-pattern)``| Executes``yes-pattern``if subexpression N has been matched, and otherwise executes``no-pattern``. |
|``(?()yes-pattern\|no-pattern)``| Executes``yes-pattern``if named subexpression``name``has been matched, and otherwise executes``no-pattern``. |
|``(?('name')yes-pattern\|no-pattern)``| Executes``yes-pattern``if named subexpression``name``has been matched, and otherwise executes``no-pattern``. |

Operator precedence

The order of precedence for the operators is as follows:

-  Escaped characters``\``-  Character set (bracket expression)``[]``-  Grouping``()``-  Single-character-ERE duplication``* + ? {m,n}``-  Concatenation

-  Anchoring``^$``-  Alternation``|``