diff options
Diffstat (limited to 'doc/html/pcre2syntax.html')
-rw-r--r-- | doc/html/pcre2syntax.html | 174 |
1 files changed, 104 insertions, 70 deletions
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html index 8364c521..1c0ccb00 100644 --- a/doc/html/pcre2syntax.html +++ b/doc/html/pcre2syntax.html @@ -15,35 +15,36 @@ please consult the man page, in case the conversion went wrong. <ul> <li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a> <li><a name="TOC2" href="#SEC2">QUOTING</a> -<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a> -<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> -<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> -<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> -<li><a name="TOC7" href="#SEC7">BINARY PROPERTIES FOR \p AND \P</a> -<li><a name="TOC8" href="#SEC8">SCRIPT MATCHING WITH \p AND \P</a> -<li><a name="TOC9" href="#SEC9">THE BIDI_CLASS PROPERTY FOR \p AND \P</a> -<li><a name="TOC10" href="#SEC10">CHARACTER CLASSES</a> -<li><a name="TOC11" href="#SEC11">QUANTIFIERS</a> -<li><a name="TOC12" href="#SEC12">ANCHORS AND SIMPLE ASSERTIONS</a> -<li><a name="TOC13" href="#SEC13">REPORTED MATCH POINT SETTING</a> -<li><a name="TOC14" href="#SEC14">ALTERNATION</a> -<li><a name="TOC15" href="#SEC15">CAPTURING</a> -<li><a name="TOC16" href="#SEC16">ATOMIC GROUPS</a> -<li><a name="TOC17" href="#SEC17">COMMENT</a> -<li><a name="TOC18" href="#SEC18">OPTION SETTING</a> -<li><a name="TOC19" href="#SEC19">NEWLINE CONVENTION</a> -<li><a name="TOC20" href="#SEC20">WHAT \R MATCHES</a> -<li><a name="TOC21" href="#SEC21">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> -<li><a name="TOC22" href="#SEC22">NON-ATOMIC LOOKAROUND ASSERTIONS</a> -<li><a name="TOC23" href="#SEC23">SCRIPT RUNS</a> -<li><a name="TOC24" href="#SEC24">BACKREFERENCES</a> -<li><a name="TOC25" href="#SEC25">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> -<li><a name="TOC26" href="#SEC26">CONDITIONAL PATTERNS</a> -<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a> -<li><a name="TOC28" href="#SEC28">CALLOUTS</a> -<li><a name="TOC29" href="#SEC29">SEE ALSO</a> -<li><a name="TOC30" href="#SEC30">AUTHOR</a> -<li><a name="TOC31" href="#SEC31">REVISION</a> +<li><a name="TOC3" href="#SEC3">BRACED ITEMS</a> +<li><a name="TOC4" href="#SEC4">ESCAPED CHARACTERS</a> +<li><a name="TOC5" href="#SEC5">CHARACTER TYPES</a> +<li><a name="TOC6" href="#SEC6">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> +<li><a name="TOC7" href="#SEC7">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> +<li><a name="TOC8" href="#SEC8">BINARY PROPERTIES FOR \p AND \P</a> +<li><a name="TOC9" href="#SEC9">SCRIPT MATCHING WITH \p AND \P</a> +<li><a name="TOC10" href="#SEC10">THE BIDI_CLASS PROPERTY FOR \p AND \P</a> +<li><a name="TOC11" href="#SEC11">CHARACTER CLASSES</a> +<li><a name="TOC12" href="#SEC12">QUANTIFIERS</a> +<li><a name="TOC13" href="#SEC13">ANCHORS AND SIMPLE ASSERTIONS</a> +<li><a name="TOC14" href="#SEC14">REPORTED MATCH POINT SETTING</a> +<li><a name="TOC15" href="#SEC15">ALTERNATION</a> +<li><a name="TOC16" href="#SEC16">CAPTURING</a> +<li><a name="TOC17" href="#SEC17">ATOMIC GROUPS</a> +<li><a name="TOC18" href="#SEC18">COMMENT</a> +<li><a name="TOC19" href="#SEC19">OPTION SETTING</a> +<li><a name="TOC20" href="#SEC20">NEWLINE CONVENTION</a> +<li><a name="TOC21" href="#SEC21">WHAT \R MATCHES</a> +<li><a name="TOC22" href="#SEC22">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> +<li><a name="TOC23" href="#SEC23">NON-ATOMIC LOOKAROUND ASSERTIONS</a> +<li><a name="TOC24" href="#SEC24">SCRIPT RUNS</a> +<li><a name="TOC25" href="#SEC25">BACKREFERENCES</a> +<li><a name="TOC26" href="#SEC26">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> +<li><a name="TOC27" href="#SEC27">CONDITIONAL PATTERNS</a> +<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a> +<li><a name="TOC29" href="#SEC29">CALLOUTS</a> +<li><a name="TOC30" href="#SEC30">SEE ALSO</a> +<li><a name="TOC31" href="#SEC31">AUTHOR</a> +<li><a name="TOC32" href="#SEC32">REVISION</a> </ul> <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br> <P> @@ -57,15 +58,27 @@ documentation. This document contains a quick-reference summary of the syntax. <pre> \x where x is non-alphanumeric is a literal x \Q...\E treat enclosed characters as literal -</PRE> +</pre> +Note that white space inside \Q...\E is always treated as literal, even if +PCRE2_EXTENDED is set, causing most other white space to be ignored. +</P> +<br><a name="SEC3" href="#TOC1">BRACED ITEMS</a><br> +<P> +With one exception, wherever brace characters { and } are required to enclose +data for constructions such as \g{2} or \k{name}, space and/or horizontal tab +characters that follow { or precede } are allowed and are ignored. In the case +of quantifiers, they may also appear before or after the comma. The exception +is \u{...} which is not Perl-compatible and is recognized only when +PCRE2_EXTRA_ALT_BSUX is set. This is an ECMAScript compatibility feature, and +follows ECMAScript's behaviour. </P> -<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br> +<br><a name="SEC4" href="#TOC1">ESCAPED CHARACTERS</a><br> <P> This table applies to ASCII and Unicode environments. An unrecognized escape sequence causes an error. <pre> \a alarm, that is, the BEL character (hex 07) - \cx "control-x", where x is any ASCII printing character + \cx "control-x", where x is a non-control ASCII character \e escape (hex 1B) \f form feed (hex 0C) \n newline (hex 0A) @@ -103,7 +116,7 @@ also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in EBCDIC environments. Note that \N not followed by an opening curly bracket has a different meaning (see below). </P> -<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br> +<br><a name="SEC5" href="#TOC1">CHARACTER TYPES</a><br> <P> <pre> . any character except newline; @@ -136,14 +149,15 @@ or in the 16-bit and 32-bit libraries. However, if locale-specific matching is happening, \s and \w may also match characters with code points in the range 128-255. If the PCRE2_UCP option is set, the behaviour of these escape sequences is changed to use Unicode properties and they match many more -characters. +characters, but there are some option settings that can restrict individual +sequences to matching only ASCII characters. </P> <P> Property descriptions in \p and \P are matched caselessly; hyphens, underscores, and white space are ignored, in accordance with Unicode's "loose matching" rules. </P> -<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> +<br><a name="SEC6" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> <P> <pre> C Other @@ -193,20 +207,20 @@ matching" rules. Zs Space separator </PRE> </P> -<br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br> +<br><a name="SEC7" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br> <P> <pre> Xan Alphanumeric: union of properties L and N Xps POSIX space: property Z or tab, NL, VT, FF, CR Xsp Perl space: property Z or tab, NL, VT, FF, CR - Xuc Univerally-named character: one that can be + Xuc Universally-named character: one that can be represented by a Universal Character Name Xwd Perl word: property Xan or underscore </pre> Perl and POSIX space are now the same. Perl added VT to its space character set at release 5.18. </P> -<br><a name="SEC7" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a><br> +<br><a name="SEC8" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a><br> <P> Unicode defines a number of binary properties, that is, properties whose only values are true or false. You can obtain a list of those that are recognized by @@ -215,7 +229,7 @@ values are true or false. You can obtain a list of those that are recognized by pcre2test -LP </PRE> </P> -<br><a name="SEC8" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br> +<br><a name="SEC9" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br> <P> Many script names and their 4-letter abbreviations are recognized in \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of @@ -224,7 +238,7 @@ course). You can obtain a list of these scripts by running this command: pcre2test -LS </PRE> </P> -<br><a name="SEC9" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a><br> +<br><a name="SEC10" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a><br> <P> <pre> \p{Bidi_Class:<class>} matches a character with the given class @@ -257,7 +271,7 @@ The recognized classes are: WS which space </PRE> </P> -<br><a name="SEC10" href="#TOC1">CHARACTER CLASSES</a><br> +<br><a name="SEC11" href="#TOC1">CHARACTER CLASSES</a><br> <P> <pre> [...] positive character class @@ -285,7 +299,7 @@ In PCRE2, POSIX character set names recognize only ASCII characters by default, but some of them use Unicode properties if PCRE2_UCP is set. You can use \Q...\E inside a character class. </P> -<br><a name="SEC11" href="#TOC1">QUANTIFIERS</a><br> +<br><a name="SEC12" href="#TOC1">QUANTIFIERS</a><br> <P> <pre> ? 0 or 1, greedy @@ -304,9 +318,12 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use {n,} n or more, greedy {n,}+ n or more, possessive {n,}? n or more, lazy + {,m} zero up to m, greedy + {,m}+ zero up to m, possessive + {,m}? zero up to m, lazy </PRE> </P> -<br><a name="SEC12" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br> +<br><a name="SEC13" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br> <P> <pre> \b word boundary @@ -324,7 +341,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use \G first matching position in subject </PRE> </P> -<br><a name="SEC13" href="#TOC1">REPORTED MATCH POINT SETTING</a><br> +<br><a name="SEC14" href="#TOC1">REPORTED MATCH POINT SETTING</a><br> <P> <pre> \K set reported start of match @@ -334,13 +351,13 @@ for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled. When this option is set, \K is honoured in positive assertions, but ignored in negative ones. </P> -<br><a name="SEC14" href="#TOC1">ALTERNATION</a><br> +<br><a name="SEC15" href="#TOC1">ALTERNATION</a><br> <P> <pre> expr|expr|expr... </PRE> </P> -<br><a name="SEC15" href="#TOC1">CAPTURING</a><br> +<br><a name="SEC16" href="#TOC1">CAPTURING</a><br> <P> <pre> (...) capture group @@ -355,35 +372,47 @@ In non-UTF modes, names may contain underscores and ASCII letters and digits; in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In both cases, a name must not start with a digit. </P> -<br><a name="SEC16" href="#TOC1">ATOMIC GROUPS</a><br> +<br><a name="SEC17" href="#TOC1">ATOMIC GROUPS</a><br> <P> <pre> (?>...) atomic non-capture group (*atomic:...) atomic non-capture group </PRE> </P> -<br><a name="SEC17" href="#TOC1">COMMENT</a><br> +<br><a name="SEC18" href="#TOC1">COMMENT</a><br> <P> <pre> (?#....) comment (not nestable) </PRE> </P> -<br><a name="SEC18" href="#TOC1">OPTION SETTING</a><br> +<br><a name="SEC19" href="#TOC1">OPTION SETTING</a><br> <P> Changes of these options within a group are automatically cancelled at the end of the group. <pre> + (?a) all ASCII options + (?aD) restrict \d to ASCII in UCP mode + (?aS) restrict \s to ASCII in UCP mode + (?aW) restrict \w to ASCII in UCP mode + (?aP) restrict all POSIX classes to ASCII in UCP mode + (?aT) restrict POSIX digit classes to ASCII in UCP mode (?i) caseless (?J) allow duplicate named groups (?m) multiline (?n) no auto capture + (?r) restrict caseless to either ASCII or non-ASCII (?s) single line (dotall) (?U) default ungreedy (lazy) - (?x) extended: ignore white space except in classes + (?x) ignore white space except in classes or \Q...\E (?xx) as (?x) but also ignore space and tab in classes - (?-...) unset option(s) - (?^) unset imnsx options + (?-...) unset the given option(s) + (?^) unset imnrsx options </pre> +(?aP) implies (?aT) as well, though this has no additional effect. However, it +means that (?-aP) is really (?-PT) which disables all ASCII restrictions for +POSIX classes. +</P> +<P> Unsetting x or xx unsets both. Several options may be set at once, and a mixture of setting and unsetting such as (?i-x) is allowed, but there may be only one hyphen. Setting (but no unsetting) is allowed after (?^ for example @@ -413,7 +442,7 @@ not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. </P> -<br><a name="SEC19" href="#TOC1">NEWLINE CONVENTION</a><br> +<br><a name="SEC20" href="#TOC1">NEWLINE CONVENTION</a><br> <P> These are recognized only at the very start of the pattern or after option settings with a similar syntax. @@ -426,7 +455,7 @@ settings with a similar syntax. (*NUL) the NUL character (binary zero) </PRE> </P> -<br><a name="SEC20" href="#TOC1">WHAT \R MATCHES</a><br> +<br><a name="SEC21" href="#TOC1">WHAT \R MATCHES</a><br> <P> These are recognized only at the very start of the pattern or after option setting with a similar syntax. @@ -435,7 +464,7 @@ setting with a similar syntax. (*BSR_UNICODE) any Unicode newline sequence </PRE> </P> -<br><a name="SEC21" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> +<br><a name="SEC22" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> <P> <pre> (?=...) ) @@ -454,9 +483,14 @@ setting with a similar syntax. (*nlb:...) ) negative lookbehind (*negative_lookbehind:...) ) </pre> -Each top-level branch of a lookbehind must be of a fixed length. +Each top-level branch of a lookbehind must have a limit for the number of +characters it matches. If any branch can match a variable number of characters, +the maximum for each branch is limited to a value set by the caller of +<b>pcre2_compile()</b> or defaulted. The default is set when PCRE2 is built +(ultimate default 255). If every branch matches a fixed number of characters, +the limit for each branch is 65535 characters. </P> -<br><a name="SEC22" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br> +<br><a name="SEC23" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br> <P> These assertions are specific to PCRE2 and are not Perl-compatible. <pre> @@ -469,7 +503,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible. (*non_atomic_positive_lookbehind:...) ) </PRE> </P> -<br><a name="SEC23" href="#TOC1">SCRIPT RUNS</a><br> +<br><a name="SEC24" href="#TOC1">SCRIPT RUNS</a><br> <P> <pre> (*script_run:...) ) script run, can be backtracked into @@ -479,7 +513,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible. (*asr:...) ) </PRE> </P> -<br><a name="SEC24" href="#TOC1">BACKREFERENCES</a><br> +<br><a name="SEC25" href="#TOC1">BACKREFERENCES</a><br> <P> <pre> \n reference by number (can be ambiguous) @@ -496,7 +530,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible. (?P=name) reference by name (Python) </PRE> </P> -<br><a name="SEC25" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> +<br><a name="SEC26" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> <P> <pre> (?R) recurse whole pattern @@ -515,15 +549,15 @@ These assertions are specific to PCRE2 and are not Perl-compatible. \g'-n' call subroutine by relative number (PCRE2 extension) </PRE> </P> -<br><a name="SEC26" href="#TOC1">CONDITIONAL PATTERNS</a><br> +<br><a name="SEC27" href="#TOC1">CONDITIONAL PATTERNS</a><br> <P> <pre> (?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern) (?(n) absolute reference condition - (?(+n) relative reference condition - (?(-n) relative reference condition + (?(+n) relative reference condition (PCRE2 extension) + (?(-n) relative reference condition (PCRE2 extension) (?(<name>) named reference condition (Perl) (?('name') named reference condition (Perl) (?(name) named reference condition (PCRE2, deprecated) @@ -538,7 +572,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference conditions or recursion tests. Such a condition is interpreted as a reference condition if the relevant named group exists. </P> -<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br> +<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br> <P> All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the name is mandatory, for the others it is optional. (*SKIP) changes its behaviour @@ -565,7 +599,7 @@ pattern is not anchored. The effect of one of these verbs in a group called as a subroutine is confined to the subroutine call. </P> -<br><a name="SEC28" href="#TOC1">CALLOUTS</a><br> +<br><a name="SEC29" href="#TOC1">CALLOUTS</a><br> <P> <pre> (?C) callout (assumed number 0) @@ -576,12 +610,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the start and the end), and the starting delimiter { matched with the ending delimiter }. To encode the ending delimiter within the string, double it. </P> -<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br> +<br><a name="SEC30" href="#TOC1">SEE ALSO</a><br> <P> <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3), <b>pcre2</b>(3). </P> -<br><a name="SEC30" href="#TOC1">AUTHOR</a><br> +<br><a name="SEC31" href="#TOC1">AUTHOR</a><br> <P> Philip Hazel <br> @@ -590,11 +624,11 @@ Retired from University Computing Service Cambridge, England. <br> </P> -<br><a name="SEC31" href="#TOC1">REVISION</a><br> +<br><a name="SEC32" href="#TOC1">REVISION</a><br> <P> -Last updated: 12 January 2022 +Last updated: 12 October 2023 <br> -Copyright © 1997-2022 University of Cambridge. +Copyright © 1997-2023 University of Cambridge. <br> <p> Return to the <a href="index.html">PCRE2 index page</a>. |