BDE 4.14.0 Production release
|
Macros | |
#define | PCRE2_CODE_UNIT_WIDTH 8 |
#define | PCRE2_STATIC |
Provide a mechanism for regular expression pattern matching.
This component provides a mechanism, bdlpcre::RegEx
, for compiling (or "preparing") regular expressions, and subsequently matching subject strings against a prepared expression and replacing the matching parts with the replacement string. The regular expressions supported by this component correspond approximately with Perl 5.10. See the appendix entitled "Perl Compatibility" below for more information.
Upon construction, a bdlpcre::RegEx
object is initially not associated with a regular expression. A regular expression pattern is compiled for use by the object using the prepare
method. Subject strings may then be matched against the prepared pattern using the set of overloaded match
methods.
The component provides the following groups of match
overloads (and similarly for matchRaw
):
match
overloads simply returns 0 if a given subject string matches the prepared regular expression, and returns a non-zero value otherwise.match
overloads returns the substring of the subject that was matched, either as a bsl::string_view
, or as a bsl::pair<size_t, size_t>
holding the (offset, length) pair.match
overloads returns a vector of either bsl::string_view
or bsl::pair<size_t, size_t>
holding the matched substrings. The first element of the vector indicate the substring of the subject that matched the entire pattern. Subsequent elements indicate the substrings of the subject that matched respective sub-patterns.The matched parts of subjects strings can be replaced with the replacement string using the set of overloaded replace
and replaceRaw
methods.
A bdlpcre::RegEx
object must first be prepared with a valid regular expression before attempting to match subject strings or replace the matched parts. We say that an instance of bdlpcre::RegEx
is in the "prepared" state if the object holds a valid regular expression, in which case calls to the overloaded match
or replace
methods of that instance are valid. Otherwise, the object is in the "unprepared" state. Upon construction, an bdlpcre::RegEx
object is in the "unprepared" state. A successful call to the prepare
method puts the object into the "prepared" state. The clear
method, as well as an unsuccessful call to prepare
, puts the object into the "unprepared" state. The isPrepared
accessor may be used to determine whether an object is prepared.
A set of flags may be optionally supplied to the prepare
method to affect specific pattern matching behavior. The flags recognized by prepare
are defined in an enumeration declared within the bdlpcre::RegEx
. The following describes these flags and their effects.
If RegEx::k_FLAG_CASELESS
is included in the flags supplied to prepare
, then letters in the regular expression pattern supplied to prepare
match both lower- and upper-case letters in subject strings subsequently supplied to match
. This is equivalent to Perl's /i
option, and can be turned off within a pattern by a (?i)
option setting.
By default, a subject string supplied to match
or replace
is treated as consisting of a single line of characters (even if it actually contains '
' characters). The start-of-line meta-character ^
matches only at the beginning of the string, and the end-of-line meta-character $
matches only at the end of the string (or before a terminating '
', if present). This matches the behavior of Perl.
If RegEx::k_FLAG_MULTILINE
is included in the flags supplied to prepare
, then start-of-line and end-of-line meta-characters match immediately following or immediately before any '
' characters in subject strings supplied to match
, respectively (as well as at the very start and end of subject strings). This is equivalent to Perl's /m
option, and can be turned off within a pattern by a (?m)
option setting. If there are no '
' characters in the subject string, or if there are no occurrences of ^
or $
in the prepared pattern, then including k_FLAG_MULTILINE
has no effect.
If RegEx::k_FLAG_UTF8
is included in the flags supplied to prepare
, then the regular expression pattern supplied to prepare
, the subject strings subsequently supplied to match
, matchRaw
, replace
, and replaceRaw
as well as the replacement string supplied to replace
and replaceRaw
are interpreted as strings of UTF-8 characters instead of strings of ASCII characters. match
and replace
return a non-zero value if pattern()
was prepared with k_FLAG_UTF8
, but the subject or the replacement are not a valid UTF-8 string. The behavior of matchRaw
is undefined if pattern()
was prepared with k_FLAG_UTF8
, but the subject is not a valid UTF-8 string. Note that JIT optimization (see below) is disabled for match
if pattern()
was prepared with k_FLAG_UTF8
.
If RegEx::k_FLAG_DOTMATCHESALL
is included in the flags supplied to prepare
, then a dot metacharacter in the pattern matches a character of any value, including one that indicates a newline. However, it only ever matches one character, even if newlines are encoded as '\r
'. If k_FLAG_DOTMATCHESALL
is not used to prepare a regular expression, a dot metacharacter will not match a newline; hence, patterns expected to match across lines will fail to do so. This flag is equivalent to Perl's /s
option, and can be changed within a pattern by a (?s)
option setting. A negative class such as [^a]
always matches newline characters, independent of the setting of this option.
If RegEx::k_FLAG_DUPNAMES
is included in the flags supplied to prepare
, then sub-pattern names can be used more than once. Alternatively this feature can be turned on within a pattern by a (?J)
option setting (see https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC16). The subpatternIndex(name)
call will fail if name
is used more than once - in that case, the namedSubpatterns()
call should be used. namedSubpatterns()
returns a set of (name, index) pairs used in the pattern.
A new string can be created by applying the regular expression pattern to the subject string in which the matching parts are replaced with the replacement string supplied to the replace
and replaceRaw
methods.
By default, a dollar character ($
) is an escape character that can specify the insertion of characters from capture groups and names from (*MARK)
or other control verbs in the pattern (see https://perldoc.perl.org/perlre#Special-Backtracking-Control-Verbs for details). The following forms are always recognized:
Either a group number or a group name can be given for <n>
. Curly braces are required only if the following character would be interpreted as part of the number or name. The number may be zero to include the entire matched string. For example, if the pattern a(b)c
is matched with =abc=
and the replacement string +$1$0$1+
, the result is =+babcb+=
.
A set of flags may be optionally supplied to the replace
and replaceRaw
method to affect specific substitution behavior. The flags recognized by replace
and replaceRaw
are defined in an enumeration declared within the bdlpcre::RegEx
. The flags are passed as a bitwise combination of OR bits in the options
argument to replace
and replaceRaw
(e.g., 'k_REPLACE_GLOBAL | k_REPLACE_LITERAL). The flags reflect PCRE_SUBSTITUTE_*
flags and are propagated to the underlying PCRE2 library substitute function. See {https://www.pcre.org/current/doc/html/pcre2api.html#SEC36} for details. The following describes these flags and their effects.
The default action of replace
and replaceRaw
is to perform just one replacement if the pattern matches. The RegEx::k_REPLACE_GLOBAL
flag requests multiple replacements in the subject string.
If RegEx::k_REPLACE_LITERAL
is set, the replacement string is not interpreted in any way.
If RegEx::k_REPLACE_EXTENDED
is set, extra processing is applied to the replacement string. Without this option, only the dollar character ($
) is special, and only the group insertion forms listed above (see {Group Insertion Forms}) are valid. When this flag is set, two things change:
<n>
may be a group number or a name. The first form specifies a default value. If group <n>
is set, its value is inserted; if not, <string>
is expanded and the result inserted. The second form specifies strings that are expanded and inserted when group <n>
is set or unset, respectively. The first form is just a convenient shorthand for ${<n>:+${<n>}:<string>}
.The RegEx::k_REPLACE_UNKNOWN_UNSET
causes references to capture groups that do not appear in the pattern to be treated as unset groups.
The RegEx::k_REPLACE_UNSET_EMPTY
causes unset capture groups (including unknown groups when RegEx::k_REPLACE_UNKNOWN_UNSET
is set) to be treated as empty strings when inserted as described in {Group Insertion Forms}. If this option is not set, an attempt to insert an unset group causes replace
and replaceRaw
to return an error. This option does not influence the extended substitution syntax described in {Extended Replacement Processing}.
Just-in-time compiling is a heavyweight optimization that can greatly speed up pattern matching on supported platforms. However, it comes at the cost of extra processing before the match is performed, so it is of most benefit when the same pattern is going to be matched many times. This does not necessarily mean many calls of a matching function; if the pattern is not anchored, matching attempts may take place many times at various positions in the subject, even for a single call. Therefore, if the subject string is very long, it may still pay to use JIT even for one-off matches.
If RegEx::k_FLAG_JIT
is included in the flags supplied to prepare
, then all following matches performed by matchRaw
will be JIT optimized. Matches performed by match
will also be JIT optimized provided that RegEx::k_FLAG_UTF8
was not supplied to prepare
(since UTF-8 string validity checking is not done during JIT compilation). To disable JIT optimization for all matches, prepare the regular expression again omitting the k_FLAG_JIT
flag.
JIT is supported on the following platforms:
The tables below demonstrate the benefit of the match
method with JIT optimizations, as well as the increased cost for prepare
when enabling JIT optimizations:
In this first table, for each pattern, prepare
was called once, and match was called 100000 times (measurements are in seconds):
In this second table, for each pattern, we measured 10000 iterations, where prepare
was called once, and match
was called once (measurements are in seconds):
Note that the tests were run on Linux / Intel Xeon CPU (3.47GHz, 64-bit), compiled with gcc-4.8.2 in optimized mode.
bdlpcre::RegEx
is const thread-safe, meaning that accessors may be invoked concurrently from different threads, but it is not safe to access or modify a bdlpcre::RegEx
in one thread while another thread modifies the same object. Specifically, the match
method can be called from multiple threads after the pattern has been prepared.
Note that bdlpcre::RegEx
incurs some overhead in order to provide thread-safe pattern matching functionality. To perform the pattern match, the underlying PCRE2 library requires a set of buffers that cannot be shared between threads.
The table below demonstrate the difference of invoking the match
method from main (thread that invokes prepare
) and other threads:
Note that JIT stack is functionally part of the match context. Using large JIT stack can incur additional performance penalty in the multi-threaded applications.
PCRE2 library supports memory allocation/deallocation functions supplied by the client. bdlpcre_regex provides wrappers around bslma
allocators that are called from the context of the PCRE2 library (C linkage). Any exceptions thrown during memory allocation are caught by the wrapper functions and are not propagated to the PCRE2 library.
The following snippets of code illustrate using this component to extract the text of the "Subject:" field from an Internet e-mail message (RFC822). The following parseSubject
function accepts an RFC822-compliant message of a specified length and returns the text of the message's subject in the result
"out" parameter:
The following is the regular expression that will be used to find the subject text of message
. The "?P<subjectText>" syntax, borrowed from Python, allows us later to refer to a particular matched sub-pattern (i.e., the text between the :
and the '\r' in the "Subject:" field of the header) by the name "subjectText":
First we compile the PATTERN
, using the prepare
method, in order to match subject strings against it. In the event that prepare
fails, the first two arguments will be loaded with diagnostic information (an informational string and an index into the pattern at which the error occurred, respectively). Two flags, RegEx::k_FLAG_CASELESS
and RegEx::k_FLAG_MULTILINE
, are used in preparing the pattern since Internet message headers contain case-insensitive content as well as '
' characters. The prepare
method returns 0 on success, and a non-zero value otherwise:
Next we call match
supplying message
and its length. The matchVector
will be populated with (offset, length) pairs describing substrings in message
that match the prepared PATTERN
. All variants of the overloaded match
method return the k_STATUS_SUCCESS
status if a match is found, k_STATUS_NO_MATCH
if a match is not found, and some other value if any error occurs. This value may help us to understand the reason of failure:
Then we pass "subjectText" to the subpatternIndex
method to obtain the index into matchVector
that describes how to locate the subject text within message
. The text is then extracted from message
and assigned to the result
"out" parameter:
The following array contains the sample Internet e-mail message from which we will extract the subject:
Finally, we call parseSubject
to extract the subject from RFC822_MESSAGE
. The assertions verify that the subject of the message is correctly extracted and assigned to the local subject
variable:
This section describes the differences in the ways that PCRE2 and Perl handle regular expressions. The differences described here are with respect to Perl versions 5.10 and above.
1) PCRE2 has only a subset of Perl's Unicode support.
2) PCRE2 allows repeat quantifiers only on parenthesized assertions, but they do not mean what you might think. For example, (?!a){3}
does not assert that the next three characters are not "a"
. It just asserts that the next character is not "a"
three times (in principle: PCRE2 optimizes this to run the assertion just once). Perl allows repeat quantifiers on other assertions such as '', but these do not seem to have any use.
3) Capturing subpatterns that occur inside negative lookahead assertions are counted, but their entries in the offsets vector are never set. Perl sometimes (but not always) sets its numerical variables from inside negative assertions.
4) The following Perl escape sequences are not supported: '\l', '\u', '\L', '\U', and '\N' when followed by a character name or Unicode value. ('\N' on its own, matching a non-newline character, is supported.) In fact these are implemented by Perl's general string-handling and are not part of its pattern matching engine. If any of these are encountered by PCRE2, an error is generated by default.
5) The Perl escape sequences ',\P,
and
\X' are supported only if PCRE2 is built with Unicode support. The properties that can be tested with '' and '\P' are limited to the general category properties such as
Lu
and Nd
, script names such as Greek or Han, and the derived properties Any
and L&
. PCRE2 does support the Cs
(surrogate) property, which Perl does not; the Perl documentation says "Because Perl hides the need for
the user to understand the internal representation of Unicode characters,
there is no need to implement the somewhat messy concept of surrogates."
6) PCRE2 does support the '\Q...\E' escape for quoting substrings. Characters in between are treated as literals. This is slightly different from Perl in that $
and @
are also handled as literals inside the quotes. In Perl, they cause variable interpolation (but of course PCRE2 does not have variables). Note the following examples:
The '\Q...\E' sequence is recognized both inside and outside character classes.
7) PCRE2 does not support the (?{code})
and (??{code})
constructions. However, there is support for recursive patterns. This is not available in Perl 5.8, but it is in Perl 5.10.
8) Subroutine calls (whether recursive or not) are treated as atomic groups. Atomic recursion is like Python, but unlike Perl. Captured values that are set outside a subroutine call can be referenced from inside in PCRE2, but not in Perl.
9) If any of the backtracking control verbs are used in a subpattern that is called as a subroutine (whether or not recursively), their effect is confined to that subpattern; it does not extend to the surrounding pattern. This is not always the case in Perl. In particular, if (*THEN)
is present in a group that is called as a subroutine, its action is limited to that group, even if the group does not contain any |
characters. Note that such subpatterns are processed as anchored at the point where they are tested.
10) If a pattern contains more than one backtracking control verb, the first one that is backtracked onto acts. For example, in the pattern A(*COMMIT)B(*PRUNE)C
a failure in B
triggers (*COMMIT),
but a failure in C
triggers (*PRUNE)
. Perl's behaviour is more complex; in many cases it is the same as PCRE2, but there are examples where it differs.
11) Most backtracking verbs in assertions have their normal actions. They are not confined to the assertion.
12) There are some differences that are concerned with the settings of captured strings when part of a pattern is repeated. For example, matching "aba"
against the pattern /^(a(b)?)+$/
in Perl leaves $2
unset, but in PCRE2 it is set to "b"
.
13) PCRE2's handling of duplicate subpattern numbers and duplicate subpattern names is not as general as Perl's. This is a consequence of the fact the PCRE2 works internally just with numbers, using an external table to translate between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B)
, where the two capturing parentheses have the same number but different names, is not supported, and causes an error at compile time. If it were allowed, it would not be possible to distinguish which parentheses matched, because both names map to capturing subpattern number
14) Perl recognizes comments in some places that PCRE2 does not, for example, between the (
and ?
at the start of a subpattern. If the /x
modifier is set, Perl allows white space between (
and ?
(though current Perls warn that this is deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED
option is set.
15) Perl, when in warning mode, gives warnings for character classes such as [A-\d]
or [a-[:digit:]]
. It then treats the hyphens as literals. PCRE2 has no warning features, so it gives an error in these cases because they are almost certainly user mistakes.
16) In PCRE2, the upper/lower case character properties Lu
and Ll
are not affected when case-independent matching is specified. For example, '{Lu}' always matches an upper case letter.
17) PCRE2 provides some extensions to the Perl regular expression facilities. This list is with respect to Perl 5.10:
(a) Although lookbehind assertions in PCRE2 must match fixed length strings, each alternative branch of a lookbehind assertion can match a different length of string. Perl requires them all to have the same length.
(b) If PCRE2_DOLLAR_ENDONLY
is set and PCRE2_MULTILINE
is not set, the $
meta-character matches only at the very end of the string.
(c) A backslash followed by a letter with no special meaning is faulted. (Perl can be made to issue a warning.)
(d) If PCRE2_UNGREEDY
is set, the greediness of the repetition quantifiers is inverted, that is, by default they are not greedy, but if followed by a question mark they are.
(e) PCRE2_ANCHORED
can be used at matching time to force a pattern to be tried only at the first matching position in the subject string.
(f) The PCRE2_NOTBOL
, PCRE2_NOTEOL
, PCRE2_NOTEMPTY
, PCRE2_NOTEMPTY_ATSTART
, and PCRE2_NO_AUTO_CAPTURE
options have no Perl equivalents.
(g) The '\R' escape sequence can be restricted to match only CR,
LF,
or CRLF
by the PCRE2_BSR_ANYCRLF
option.
(h) The callout facility is PCRE2-specific.
(i) The partial matching facility is PCRE2-specific.
(j) The alternative matching function (pcre2_dfa_match()
matches in a different way and is not Perl-compatible.
(k) PCRE2 recognizes some special sequences such as (*CR)
at the start of a pattern that set overall options that cannot be changed within the pattern.
#define PCRE2_CODE_UNIT_WIDTH 8 |
#define PCRE2_STATIC |