Regular expression syntax

Last published : Apr 17, 2026

This topic includes the following:

About regular expressions
Wildcards
Anchors
Marked subexpressions
Non-marking groupings
Repeats
Back references
Alternation
Character sets
Escapes
Perl-specific extensions
Operator precedence

About regular expressions

The Arctera Insight Classification supports a regular expression syntax that is based on the syntax in the Perl programming language. In Perl regular expressions, all characters match themselves except for the following special characters:

.[]{}()\*+?|^$

For more information on the Perl syntax, see the following webpage:

https://perldoc.perl.org/perlre.html

You may find it helpful to build and test your regular expressions using the free online tool at https://regex101.com. This tool displays an explanation of your regular expression as you type it, and also lists all matches between the regular expression and a test string of your choice. The default regular expression flavor, pcre (php), is compatible with the Arctera Insight Classification.

Note: Looking for regular expression matches is considerably slower than looking for matches for specific words or phrases. You can greatly improve performance and accuracy by looking for instances where both types of matches occur in proximity to each other. To do this, set up anAll ofcondition group that contains both a regular expression condition and acontains textcondition for finding specific words and phrases, and specify the required distance within which matches must occur. The Arctera Insight Classification first evaluates thecontains textcondition and only then looks for a regular expression match.

Wildcards

The . (period) character matches any single character, when it is used outside of a character set.

Anchors

The ^ (caret) character matches the start of a line. The $ (dollar) character matches the end of a line.

Marked subexpressions

A section that is surrounded with the characters ( and ) acts as a marked subexpression. The matching algorithms captures whatever matches the subexpression. Marked subexpressions can be repeated, or a back-reference can refer to them.

Non-marking groupings

A marked subexpression is useful for lexically grouping part of a regular expression, but it has the side-effect of additional overhead. As an alternative, you can lexically group part of a regular expression without generating a marked subexpression by using (?: and ). For example, (?:ab)+ repeats ab without splitting out any separate subexpressions.

Repeats

You can repeat any atom (single character, marked or non-marked subexpression, or character class) with the operators *, +, ?, and {}.

Related information

Policies

Table: Repeat operators

Operator	Description
`*`	Matches the preceding atom zero or more times. For example, `a*b` matches any of the following\:
	lang=txt
	b
	ab
	aaaaaaaab

`+`	Matches the preceding atom one or more times. For example, `a+b` matches either of the following\:
	lang=txt
	ab
	aaaaaaaab

	However, it does not match `b` .
`?`	Matches the preceding atom zero or one times. For example, `ca?b` matches either of the following\:
	lang=txt
	cb
	cab

	However, it does not match `caab` .
`{}`	Repeats the preceding atom with a bounded repeat.
	`a{n}` matches `a` repeated exactly `n` times.
	`a{n,}` matches `a` repeated `n` or more times.
	`a{n,m}` matches `a` repeated between `n` and `m` times inclusive.
	For example, `^a{2,3}$` matches either of the following:
	lang=txt
	aa
	aaa

	However, it does not match `a`
	or `aaaa` .

These operators are "greedy"; they consume as much input as possible. However, non-greedy versions are available that consume as little input as possible while still producing a match. By following the repeat operators *, +, ?, and {} with the ? character, the repeats become non-greedy.

By default, when a repeated pattern does not match, the Arctera Insight Classification backtracks until it finds a match. This behavior can sometimes be undesirable for matchmaking or performance reasons, so there are also "possessive" repeats. These match as much as possible and do not then allow backtracking if the rest of the expression fails to match.

Back references

An escape character that is followed by a digit n, where n is in the range 1 through 9, matches the same string that the subexpression n matched. For example, consider the following expression:

^(a{2,3}).*\1$

This matches aaabbaaa, but it does not match aaabba.

Alternation

The | operator matches either of its arguments. For example, abc|def matches both abc and def.

You can use parentheses to group alternations. For example, ab(?:d|ef) matches both abd and abef.

Character sets

A character set is a bracket expression that is enclosed within the characters [ and ]. It defines a set of characters, and matches any single character that is a member of the set.

A bracket expression can contain any combination of the following:

Single characters. For example, [abc] matches any of the characters a, b, or c.
Character ranges. For example, [a-c] matches any single character in the range a through c. By default, for Perl regular expressions, a character x is within the range y to z, if the code point of the character lies within the code points of the endpoints of the range.
Negation. If the bracket expression begins with the ^ character, it matches the complement of the characters that it contains. For example, [^a-c] matches any character that is not in the range a through c.

Character classes. An expression of the form [[:name:]] matches the named character class name. For example, [[:lower:]] matches any lowercase character. The supported character classes are as follows:

alnum	Any alphanumeric character.	punct	Any punctuation character.
alpha	Any alphabetic character.	s	Any whitespace character.
blank	Any whitespace character that is not a line separator.	space	Any whitespace character.
cntrl	Any control character.	unicode	Any extended character whose code point is above 255 in value.
d	Any decimal digit.	u	Any uppercase character.
digit	Any decimal digit.	upper	Any uppercase character.
graph	Any graphical character.	w	Any word character (alphanumeric characters plus the underscore).
l	Any lowercase character.	word	Any word character (alphanumeric characters plus the underscore).
lower	Any lowercase character.	xdigit	Any hexadecimal digit character.
print	Any printable character.	-	-

Escaped characters. All the escape sequences that match a single character or character class are permitted within a character class definition. For example, [[]] matches both [ and ], whereas [\W\d] matches any character that is either a digit or not a word character.
Combinations. You can combine one or more of the above in a character set declaration. For example, [a-cmnx-y\d].

Escapes

Any special character that is preceded by an escape matches itself.

Table: Escape sequences that are synonyms for single characters

Escape	Character
\a	\a
\e	0x1B
\f	\f
\n	\n
\r	\r
\t	\t
\v	\v
\b	\b (but only inside a character class declaration).
\c X	An ASCII escape sequence: the character whose code point is X % 32.
\x XX	A hexadecimal escape sequence: matches the single character whose code point is 0x XX .
\x{ XXXX }	A hexadecimal escape sequence: matches the single character whose code point is 0x XXXX .
\0 ddd	An octal escape sequence: matches the single character whose code point is 0 ddd .
\N{ name }	Matches the single character that has the symbolic name name . For example, \N{newline} matches the single character \n.

"Single character" character classes

When x is the name of a character class, the escaped character x matches any character that is a member of the class. Conversely, X matches any character that is not a member of the x class.

Table: Escape sequences for "single character" character classes

Escape	Equivalent to	Escape	Equivalent to
\d	\[\[:digit:\\]]	\D	\[^\[:digit:\\]]
\l	\[\[:lower:\\]]	\L	\[^\[:lower:\\]]
\s	\[\[:space:\\]]	\S	\[^\[:space:\\]]
\u	\[\[:upper:\\]]	\U	\[^\[:upper:\\]]
\w	\[\[:word:\\]]	\W	\[^\[:word:\\]]
\h	Horizontal whitespace	\H	Not horizontal whitespace
\v	Vertical whitespace	\V	Not vertical whitespace

Word boundaries

The following escape sequences match boundaries of words.

Table: Escape sequences for word boundaries

Escape	Description
\<	Matches the start of a word.
\>	Matches the end of a word.
\b	Matches a word boundary (the start or end of a word).
\B	Matches only when not at a word boundary.

Line endings

The following escape sequences match line endings.

Table: Escape sequences for line endings

Escape	Description
\n	Newline.
\r	CR.
\R	Any line-ending character sequence. This is identical to the following expression\:
	`(?>\x0D\x0A?\\|\[\x0A-\x0C\x85\x{2028}\x{2029}\])`

Other escapes

Except for the following characters, any escape sequence matches the character that is escaped:

' ` A C E G K Q X z Z
 ```txt
 
 For example,``\@``matches a literal``@``.
 
 Perl-specific extensions
 
 All Perl-specific extensions to the regular expression syntax start with``(?``.
 
 Named subexpressions
 
 You can create a named subexpression as follows:```txt
 (?\<NAME\>expression)
 
 ```txt
 You can then refer to the subexpression by the name``NAME``. Alternatively, you can delimit the name, as in the following:```txt
 (?'NAME'expression)
 ```txt
 
 You can then refer to the subexpression in a backreference using either``\g{NAME}``or``\k<NAME>``.
 
 Comments``(?# ... )``is treated as a comment. Its contents are ignored.
 
 Modifiers``(?imsx-imsx ... )``alters which of the Perl modifiers are in effect within the pattern. Changes take effect from the point that the block is first seen and extend to any enclosing``)``. Letters before a``'-'``turn this Perl modifier on, and those after the``'-'``turn it off.``(?imsx-imsx:pattern)``applies the specified modifiers to``pattern``only.
 
 Non-marking groups``(?:pattern)``lexically groups``pattern``, without generating an additional subexpression.
 
 Lookahead``(?=pattern)``consumes zero characters, but only if``pattern``matches.``(?!pattern)``consumes zero characters, but only if``pattern``does not match.
 
 You typically use lookahead to create the logical AND of two regular expressions. For example, if a password must contain a lowercase letter, uppercase letter, and punctuation symbol, and it must be at least six characters long, then you can use the following expression to validate the password:```txt
 (?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}
 
 ```txt
 Lookbehind``(?<=pattern)``consumes zero characters, but only if``pattern``can be matched against the characters that precede the current position (``pattern``must be of fixed length).``(?<!pattern)``consumes zero characters, but only if``pattern``cannot be matched against the characters that precede the current position (``pattern``must be of fixed length).
 
 Independent subexpressions``(?>pattern)``matches``pattern``independently of the surrounding patterns. The expression never backtracks into``pattern``.
 
 Conditional expressions``(?(condition)yes-pattern|no-pattern)``tries to match``yes-pattern``if the condition is``true``, and otherwise tries to match``no-pattern``.``(?(condition)yes-pattern)``tries to match``yes-pattern``if the condition is``true``, and otherwise matches the``NULL``string.``condition``may be one of the following:
 
 -  A forward lookahead assert.
 
 -  The index of a marked subexpression (the condition becomes true if the subexpression has been matched).
 
 Here is a summary of the possible predicates:
 
 |``(?(?=assert)yes-pattern\|no-pattern)``| Executes``yes-pattern``if the forward look-ahead assert matches, and otherwise executes``no-pattern``. |
 | --- | --- |
 |``(?(?!assert)yes-pattern\|no-pattern)``| Executes``yes-pattern``if the forward look-ahead assert does not match, and otherwise executes``no-pattern``. |
 |``(?(N)yes-pattern\|no-pattern)``| Executes``yes-pattern``if subexpression N has been matched, and otherwise executes``no-pattern``. |
 |``(?()yes-pattern\|no-pattern)``| Executes``yes-pattern``if named subexpression``name``has been matched, and otherwise executes``no-pattern``. |
 |``(?('name')yes-pattern\|no-pattern)``| Executes``yes-pattern``if named subexpression``name``has been matched, and otherwise executes``no-pattern``. |
 
 Operator precedence
 
 The order of precedence for the operators is as follows:
 
 -  Escaped characters``\``-  Character set (bracket expression)``[]``-  Grouping``()``-  Single-character-ERE duplication``* + ? {m,n}``-  Concatenation
 
 -  Anchoring``^$``-  Alternation``|``

Table: Repeat operators

Table: Escape sequences that are synonyms for single characters

Table: Escape sequences for "single character" character classes

Table: Escape sequences for word boundaries

Table: Escape sequences for line endings

PDF Export Options

Share this page

Was this page helpful?