BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlpcre_regex.h
Go to the documentation of this file.
1/// @file bdlpcre_regex.h
2///
3/// The content of this file has been pre-processed for Doxygen.
4///
5
6
7// bdlpcre_regex.h -*-C++-*-
8#ifndef INCLUDED_BDLPCRE_REGEX
9#define INCLUDED_BDLPCRE_REGEX
10
11#include <bsls_ident.h>
12BSLS_IDENT("$Id$ $CSID$")
13
14/// @defgroup bdlpcre_regex bdlpcre_regex
15/// @brief Provide a mechanism for regular expression pattern matching.
16/// @addtogroup bdl
17/// @{
18/// @addtogroup bdlpcre
19/// @{
20/// @addtogroup bdlpcre_regex
21/// @{
22///
23/// <h1> Outline </h1>
24/// * <a href="#bdlpcre_regex-purpose"> Purpose</a>
25/// * <a href="#bdlpcre_regex-classes"> Classes </a>
26/// * <a href="#bdlpcre_regex-description"> Description </a>
27/// * <a href="#bdlpcre_regex-prepared-state"> "Prepared" State </a>
28/// * <a href="#bdlpcre_regex-prepare-time-flags"> Prepare-Time Flags </a>
29/// * <a href="#bdlpcre_regex-case-insensitive-matching"> Case-Insensitive Matching </a>
30/// * <a href="#bdlpcre_regex-multi-line-matching"> Multi-Line Matching </a>
31/// * <a href="#bdlpcre_regex-utf-8-support"> UTF-8 Support </a>
32/// * <a href="#bdlpcre_regex-dot-matches-all"> Dot Matches All </a>
33/// * <a href="#bdlpcre_regex-allow-duplicate-named-groups"> Allow Duplicate Named Groups (sub-patterns) </a>
34/// * <a href="#bdlpcre_regex-creating-a-new-string-with-replacement"> Creating a New String with Replacement </a>
35/// * <a href="#bdlpcre_regex-group-insertion-forms"> Group Insertion Forms </a>
36/// * <a href="#bdlpcre_regex-replacement-flags"> Replacement Flags </a>
37/// * <a href="#bdlpcre_regex-global-replacement"> Global Replacement </a>
38/// * <a href="#bdlpcre_regex-the-replacement-string-is-literal"> The Replacement String is Literal </a>
39/// * <a href="#bdlpcre_regex-extended-replacement-processing"> Extended Replacement Processing </a>
40/// * <a href="#bdlpcre_regex-treat-unknown-group-as-unset"> Treat Unknown Group As Unset </a>
41/// * <a href="#bdlpcre_regex-insert-an-empty-string-for-unset-group"> Insert An Empty String For Unset Group </a>
42/// * <a href="#bdlpcre_regex-jit-compiling-optimization"> JIT Compiling Optimization </a>
43/// * <a href="#bdlpcre_regex-thread-safety"> Thread Safety </a>
44/// * <a href="#bdlpcre_regex-note-on-memory-allocation-exceptions"> Note on Memory Allocation Exceptions </a>
45/// * <a href="#bdlpcre_regex-usage"> Usage </a>
46/// * <a href="#bdlpcre_regex-appendix-perl-compatibility"> Appendix: Perl Compatibility </a>
47/// * <a href="#bdlpcre_regex-additional-copyright-notice"> Additional Copyright Notice </a>
48///
49/// # Purpose {#bdlpcre_regex-purpose}
50/// Provide a mechanism for regular expression pattern matching.
51///
52/// # Classes {#bdlpcre_regex-classes}
53///
54/// - bdlpcre::RegEx: mechanism for compiling and matching regular expressions
55///
56/// @see http://www.pcre.org/
57///
58/// # Description {#bdlpcre_regex-description}
59/// This component provides a mechanism, `bdlpcre::RegEx`, for
60/// compiling (or "preparing") regular expressions, and subsequently matching
61/// subject strings against a prepared expression and replacing the matching
62/// parts with the replacement string. The regular expressions supported by
63/// this component correspond approximately with Perl 5.10. See the appendix
64/// entitled "Perl Compatibility" below for more information.
65///
66/// Upon construction, a `bdlpcre::RegEx` object is initially not associated
67/// with a regular expression. A regular expression pattern is compiled for use
68/// by the object using the `prepare` method. Subject strings may then be
69/// matched against the prepared pattern using the set of overloaded `match`
70/// methods.
71///
72/// The component provides the following groups of `match` overloads (and
73/// similarly for `matchRaw`):
74///
75/// 1. The first group of `match` overloads simply returns 0 if a given subject
76/// string matches the prepared regular expression, and returns a non-zero
77/// value otherwise.
78/// 2. The second group of `match` overloads returns the substring of the
79/// subject that was matched, either as a `bsl::string_view`, or as a
80/// `bsl::pair<size_t, size_t>` holding the (offset, length) pair.
81/// 3. The third group of `match` overloads returns a vector of either
82/// `bsl::string_view` or `bsl::pair<size_t, size_t>` holding the matched
83/// substrings. The first element of the vector indicate the substring of
84/// the subject that matched the entire pattern. Subsequent elements
85/// indicate the substrings of the subject that matched respective
86/// sub-patterns.
87///
88/// The matched parts of subjects strings can be replaced with the replacement
89/// string using the set of overloaded `replace` and `replaceRaw` methods.
90///
91/// ## "Prepared" State {#bdlpcre_regex-prepared-state}
92///
93///
94/// A `bdlpcre::RegEx` object must first be prepared with a valid regular
95/// expression before attempting to match subject strings or replace the matched
96/// parts. We say that an instance of `bdlpcre::RegEx` is in the "prepared"
97/// state if the object holds a valid regular expression, in which case calls to
98/// the overloaded `match` or `replace` methods of that instance are valid.
99/// Otherwise, the object is in the "unprepared" state. Upon construction, an
100/// `bdlpcre::RegEx` object is in the "unprepared" state. A successful call to
101/// the `prepare` method puts the object into the "prepared" state. The `clear`
102/// method, as well as an unsuccessful call to `prepare`, puts the object into
103/// the "unprepared" state. The `isPrepared` accessor may be used to determine
104/// whether an object is prepared.
105///
106/// ## Prepare-Time Flags {#bdlpcre_regex-prepare-time-flags}
107///
108///
109/// A set of flags may be optionally supplied to the `prepare` method to affect
110/// specific pattern matching behavior. The flags recognized by `prepare` are
111/// defined in an enumeration declared within the `bdlpcre::RegEx`. The
112/// following describes these flags and their effects.
113///
114/// ### Case-Insensitive Matching {#bdlpcre_regex-case-insensitive-matching}
115///
116///
117/// If `RegEx::k_FLAG_CASELESS` is included in the flags supplied to `prepare`,
118/// then letters in the regular expression pattern supplied to `prepare` match
119/// both lower- and upper-case letters in subject strings subsequently supplied
120/// to `match`. This is equivalent to Perl's `/i` option, and can be turned off
121/// within a pattern by a `(?i)` option setting.
122///
123/// ### Multi-Line Matching {#bdlpcre_regex-multi-line-matching}
124///
125///
126/// By default, a subject string supplied to `match` or `replace` is treated as
127/// consisting of a single line of characters (even if it actually contains '\n'
128/// characters). The start-of-line meta-character `^` matches only at the
129/// beginning of the string, and the end-of-line meta-character `$` matches only
130/// at the end of the string (or before a terminating '\n', if present). This
131/// matches the behavior of Perl.
132///
133/// If `RegEx::k_FLAG_MULTILINE` is included in the flags supplied to `prepare`,
134/// then start-of-line and end-of-line meta-characters match immediately
135/// following or immediately before any '\n' characters in subject strings
136/// supplied to `match`, respectively (as well as at the very start and end of
137/// subject strings). This is equivalent to Perl's `/m` option, and can be
138/// turned off within a pattern by a `(?m)` option setting. If there are no
139/// '\n' characters in the subject string, or if there are no occurrences of `^`
140/// or `$` in the prepared pattern, then including `k_FLAG_MULTILINE` has no
141/// effect.
142///
143/// ### UTF-8 Support {#bdlpcre_regex-utf-8-support}
144///
145///
146/// If `RegEx::k_FLAG_UTF8` is included in the flags supplied to `prepare`, then
147/// the regular expression pattern supplied to `prepare`, the subject strings
148/// subsequently supplied to `match`, `matchRaw`, `replace`, and `replaceRaw` as
149/// well as the replacement string supplied to `replace` and `replaceRaw` are
150/// interpreted as strings of UTF-8 characters instead of strings of ASCII
151/// characters. `match` and `replace` return a non-zero value if `pattern()`
152/// was prepared with `k_FLAG_UTF8`, but the subject or the replacement are not
153/// a valid UTF-8 string. The behavior of `matchRaw` is undefined if
154/// `pattern()` was prepared with `k_FLAG_UTF8`, but the subject is not a valid
155/// UTF-8 string. Note that JIT optimization (see below) is disabled for
156/// `match` if `pattern()` was prepared with `k_FLAG_UTF8`.
157///
158/// ### Dot Matches All {#bdlpcre_regex-dot-matches-all}
159///
160///
161/// If `RegEx::k_FLAG_DOTMATCHESALL` is included in the flags supplied to
162/// `prepare`, then a dot metacharacter in the pattern matches a character of
163/// any value, including one that indicates a newline. However, it only ever
164/// matches one character, even if newlines are encoded as '\r\n'. If
165/// `k_FLAG_DOTMATCHESALL` is not used to prepare a regular expression, a dot
166/// metacharacter will *not* match a newline; hence, patterns expected to match
167/// across lines will fail to do so. This flag is equivalent to Perl's `/s`
168/// option, and can be changed within a pattern by a `(?s)` option setting. A
169/// negative class such as `[^a]` always matches newline characters, independent
170/// of the setting of this option.
171///
172/// ### Allow Duplicate Named Groups (sub-patterns) {#bdlpcre_regex-allow-duplicate-named-groups}
173///
174///
175/// If `RegEx::k_FLAG_DUPNAMES` is included in the flags supplied to `prepare`,
176/// then sub-pattern names can be used more than once. Alternatively this
177/// feature can be turned on within a pattern by a `(?J)` option setting
178/// (see https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC16). The
179/// `subpatternIndex(name)` call will fail if `name` is used more than once - in
180/// that case, the `namedSubpatterns()` call should be used.
181/// `namedSubpatterns()` returns a set of (name, index) pairs used in the
182/// pattern.
183///
184/// ## Creating a New String with Replacement {#bdlpcre_regex-creating-a-new-string-with-replacement}
185///
186///
187/// A new string can be created by applying the regular expression pattern to
188/// the subject string in which the matching parts are replaced with the
189/// replacement string supplied to the `replace` and `replaceRaw` methods.
190///
191/// ### Group Insertion Forms {#bdlpcre_regex-group-insertion-forms}
192///
193///
194/// By default, a dollar character (`$`) is an escape character that can specify
195/// the insertion of characters from capture groups and names from `(*MARK)` or
196/// other control verbs in the pattern (see
197/// https://perldoc.perl.org/perlre#Special-Backtracking-Control-Verbs for
198/// details). The following forms are always recognized:
199/// @code
200/// $$ insert a dollar character
201/// $<n> or ${<n>} insert the contents of group <n>
202/// $*MARK or ${*MARK} insert a control verb name
203/// @endcode
204/// Either a group number or a group name can be given for `<n>`. Curly braces
205/// are required only if the following character would be interpreted as part of
206/// the number or name. The number may be zero to include the entire matched
207/// string. For example, if the pattern `a(b)c` is matched with `=abc=` and the
208/// replacement string `+$1$0$1+`, the result is `=+babcb+=`.
209///
210/// ### Replacement Flags {#bdlpcre_regex-replacement-flags}
211///
212///
213/// A set of flags may be optionally supplied to the `replace` and `replaceRaw`
214/// method to affect specific substitution behavior. The flags recognized by
215/// `replace` and `replaceRaw` are defined in an enumeration declared within the
216/// `bdlpcre::RegEx`. The flags are passed as a bitwise combination of OR bits
217/// in the `options` argument to `replace` and `replaceRaw` (e.g.,
218/// 'k_REPLACE_GLOBAL | k_REPLACE_LITERAL). The flags reflect
219/// `PCRE_SUBSTITUTE_*` flags and are propagated to the underlying PCRE2 library
220/// substitute function. See
221/// {https://www.pcre.org/current/doc/html/pcre2api.html#SEC36} for details. The
222/// following describes these flags and their effects.
223///
224/// #### Global Replacement {#bdlpcre_regex-global-replacement}
225///
226///
227/// The default action of `replace` and `replaceRaw` is to perform just one
228/// replacement if the pattern matches. The `RegEx::k_REPLACE_GLOBAL` flag
229/// requests multiple replacements in the subject string.
230///
231/// ### The Replacement String is Literal {#bdlpcre_regex-the-replacement-string-is-literal}
232///
233///
234/// If `RegEx::k_REPLACE_LITERAL` is set, the replacement string is not
235/// interpreted in any way.
236///
237/// #### Extended Replacement Processing {#bdlpcre_regex-extended-replacement-processing}
238///
239///
240/// If `RegEx::k_REPLACE_EXTENDED` is set, extra processing is applied to the
241/// replacement string. Without this option, only the dollar character (`$`) is
242/// special, and only the group insertion forms listed above (see
243/// {Group Insertion Forms}) are valid. When this flag is set, two things
244/// change:
245///
246/// * Firstly, backslash in a replacement string is interpreted as an escape
247/// character. The usual forms such as '\n' or '\x{ddd}' can be used to
248/// specify particular character codes, and backslash followed by any
249/// non-alphanumeric character quotes that character. Extended quoting can
250/// be coded using '\Q...\E', exactly as in the pattern string.
251/// * The second effect is to add more flexibility to capture group
252/// substitution. The syntax is similar to that used by Bash:
253/// ..
254/// ${<n>:-<string>}
255/// ${<n>:+<string1>:<string2>}
256/// ..
257/// As before, `<n>` may be a group number or a name. The first form
258/// specifies a default value. If group `<n>` is set, its value is inserted;
259/// if not, `<string>` is expanded and the result inserted. The second form
260/// specifies strings that are expanded and inserted when group `<n>` is set
261/// or unset, respectively. The first form is just a convenient shorthand
262/// for `${<n>:+${<n>}:<string>}`.
263///
264/// #### Treat Unknown Group As Unset {#bdlpcre_regex-treat-unknown-group-as-unset}
265///
266///
267/// The `RegEx::k_REPLACE_UNKNOWN_UNSET` causes references to capture groups
268/// that do not appear in the pattern to be treated as unset groups.
269///
270/// #### Insert An Empty String For Unset Group {#bdlpcre_regex-insert-an-empty-string-for-unset-group}
271///
272///
273/// The `RegEx::k_REPLACE_UNSET_EMPTY` causes unset capture groups (including
274/// unknown groups when `RegEx::k_REPLACE_UNKNOWN_UNSET` is set) to be treated
275/// as empty strings when inserted as described in {Group Insertion Forms}. If
276/// this option is not set, an attempt to insert an unset group causes `replace`
277/// and `replaceRaw` to return an error. This option does not influence the
278/// extended substitution syntax described in {Extended Replacement Processing}.
279///
280/// ## JIT Compiling Optimization {#bdlpcre_regex-jit-compiling-optimization}
281///
282///
283/// Just-in-time compiling is a heavyweight optimization that can greatly speed
284/// up pattern matching on supported platforms. However, it comes at the cost
285/// of extra processing before the match is performed, so it is of most benefit
286/// when the same pattern is going to be matched many times. This does not
287/// necessarily mean many calls of a matching function; if the pattern is not
288/// anchored, matching attempts may take place many times at various positions
289/// in the subject, even for a single call. Therefore, if the subject string is
290/// very long, it may still pay to use JIT even for one-off matches.
291///
292/// If `RegEx::k_FLAG_JIT` is included in the flags supplied to `prepare`, then
293/// all following matches performed by `matchRaw` will be JIT optimized.
294/// Matches performed by `match` will also be JIT optimized provided that
295/// `RegEx::k_FLAG_UTF8` was not supplied to `prepare` (since UTF-8 string
296/// validity checking is not done during JIT compilation). To disable JIT
297/// optimization for all matches, prepare the regular expression again omitting
298/// the `k_FLAG_JIT` flag.
299///
300/// JIT is supported on the following platforms:
301/// @code
302/// ARM 32-bit (v5, v7, and Thumb2)
303/// ARM 64-bit
304/// Intel x86 32-bit and 64-bit
305/// MIPS 32-bit and 64-bit
306/// Power PC 32-bit and 64-bit
307/// SPARC 32-bit
308/// @endcode
309///
310/// The tables below demonstrate the benefit of the `match` method with JIT
311/// optimizations, as well as the increased cost for `prepare` when enabling JIT
312/// optimizations:
313/// @code
314/// Legend
315/// ------
316/// 'SIMPLE_PATTERN':
317/// Pattern - X(abc)*Z
318/// Subject - XXXabcabcZZZ
319///
320/// 'EMAIL_PATTERN':
321/// Pattern - [A-Za-z0-9._-]+@[[A-Za-z0-9.-]+
322/// Subject - john.dow@bloomberg.net
323///
324/// 'IP_ADDRESS_PATTERN':
325/// Pattern - (?:[0-9]{1,3}\.){3}[0-9]{1,3}
326/// Subject - 255.255.255.255
327///
328/// Each pattern/subject returns 1 match.
329/// @endcode
330/// In this first table, for each pattern, `prepare` was called once, and match
331/// was called 100000 times (measurements are in seconds):
332/// @code
333/// Table 1: Performance Improvement for 'match' using k_JIT_FLAG
334/// +--------------------+---------------------+---------------------+
335/// | Pattern | 'match' without-JIT | 'match' using-JIT |
336/// +====================+=====================+=====================+
337/// | SIMPLE_PATTERN | 0.0559 (~5.1x) | 0.0108 |
338/// +--------------------+---------------------+---------------------+
339/// | EMAIL_PATTERN | 0.0222 (~2.6x) | 0.0086 |
340/// +--------------------+---------------------+---------------------+
341/// | IP_ADDRESS_PATTERN | 0.0331 (~5.3x) | 0.0062 |
342/// +--------------------+---------------------+---------------------+
343/// @endcode
344/// In this second table, for each pattern, we measured 10000 iterations, where
345/// `prepare` was called once, and `match` was called once (measurements are in
346/// seconds):
347/// @code
348/// Table 2: Performance Cost for 'prepare' using k_JIT_FLAG
349/// +--------------------+-----------------------+-----------------------+
350/// | Pattern | 'prepare' without-JIT | 'prepare' using-JIT |
351/// +====================+=======================+=======================+
352/// | SIMPLE_PATTERN | 0.2514 | 2.1426 (~8.5x) |
353/// +--------------------+-----------------------+-----------------------+
354/// | EMAIL_PATTERN | 0.3386 | 2.5758 (~7.6x) |
355/// +--------------------+-----------------------+-----------------------+
356/// | IP_ADDRESS_PATTERN | 0.3016 | 2.4433 (~8.1x) |
357/// +--------------------+-----------------------+-----------------------+
358/// @endcode
359/// Note that the tests were run on Linux / Intel Xeon CPU (3.47GHz, 64-bit),
360/// compiled with gcc-4.8.2 in optimized mode.
361///
362/// ## Thread Safety {#bdlpcre_regex-thread-safety}
363///
364///
365/// `bdlpcre::RegEx` is *const* *thread-safe*, meaning that accessors may be
366/// invoked concurrently from different threads, but it is not safe to access or
367/// modify a `bdlpcre::RegEx` in one thread while another thread modifies the
368/// same object. Specifically, the `match` method can be called from multiple
369/// threads after the pattern has been prepared.
370///
371/// Note that `bdlpcre::RegEx` incurs some overhead in order to provide
372/// thread-safe pattern matching functionality. To perform the pattern match,
373/// the underlying PCRE2 library requires a set of buffers that cannot be shared
374/// between threads.
375///
376/// The table below demonstrate the difference of invoking the `match` method
377/// from main (thread that invokes `prepare`) and other threads:
378/// @code
379/// Table 3: Performance cost for 'match' in multi-threaded application
380/// +--------------------+-----------------------+----------------------------+
381/// | Pattern | 'match' (main thread) | 'match' (other thread(s)) |
382/// +====================+=======================+============================+
383/// | SIMPLE_PATTERN | 0.0549 (~1.4x) | 0.0759 |
384/// +--------------------+-----------------------+----------------------------+
385/// | EMAIL_PATTERN | 0.0259 (~1.8x) | 0.0464 |
386/// +--------------------+-----------------------+----------------------------+
387/// | IP_ADDRESS_PATTERN | 0.0377 (~1.5x) | 0.0560 |
388/// +--------------------+-----------------------+----------------------------+
389/// @endcode
390/// Note that JIT stack is functionally part of the match context. Using large
391/// JIT stack can incur additional performance penalty in the multi-threaded
392/// applications.
393///
394/// ## Note on Memory Allocation Exceptions {#bdlpcre_regex-note-on-memory-allocation-exceptions}
395///
396///
397/// PCRE2 library supports memory allocation/deallocation functions supplied by
398/// the client. @ref bdlpcre_regex provides wrappers around `bslma` allocators
399/// that are called from the context of the PCRE2 library (C linkage). Any
400/// exceptions thrown during memory allocation are caught by the wrapper
401/// functions and are not propagated to the PCRE2 library.
402///
403/// ## Usage {#bdlpcre_regex-usage}
404///
405///
406/// The following snippets of code illustrate using this component to extract
407/// the text of the "Subject:" field from an Internet e-mail message (RFC822).
408/// The following `parseSubject` function accepts an RFC822-compliant message of
409/// a specified length and returns the text of the message's subject in the
410/// `result` "out" parameter:
411/// @code
412/// int parseSubject(bsl::string *result,
413/// const char *message,
414/// bsl::size_t messageLength)
415/// // Parse the specified 'message' of the specified 'messageLength' for
416/// // the "Subject:" field of 'message'. Return 0 on success and load the
417/// // specified 'result' with the text of the subject of 'message'; return
418/// // a non-zero value otherwise with no effect on 'result'.
419/// {
420/// @endcode
421/// The following is the regular expression that will be used to find the
422/// subject text of `message`. The "?P<subjectText>" syntax, borrowed from
423/// Python, allows us later to refer to a particular matched sub-pattern (i.e.,
424/// the text between the `:` and the '\r' in the "Subject:" field of the header)
425/// by the name "subjectText":
426/// @code
427/// const char PATTERN[] = "^subject:(?P<subjectText>[^\r]*)";
428/// @endcode
429/// First we compile the `PATTERN`, using the `prepare` method, in order to
430/// match subject strings against it. In the event that `prepare` fails, the
431/// first two arguments will be loaded with diagnostic information (an
432/// informational string and an index into the pattern at which the error
433/// occurred, respectively). Two flags, `RegEx::k_FLAG_CASELESS` and
434/// `RegEx::k_FLAG_MULTILINE`, are used in preparing the pattern since Internet
435/// message headers contain case-insensitive content as well as '\n' characters.
436/// The `prepare` method returns 0 on success, and a non-zero value otherwise:
437/// @code
438/// RegEx regEx;
439/// bsl::string errorMessage;
440/// size_t errorOffset;
441///
442/// int returnValue = regEx.prepare(&errorMessage,
443/// &errorOffset,
444/// PATTERN,
445/// RegEx::k_FLAG_CASELESS |
446/// RegEx::k_FLAG_MULTILINE);
447/// assert(0 == returnValue);
448/// @endcode
449/// Next we call `match` supplying `message` and its length. The `matchVector`
450/// will be populated with (offset, length) pairs describing substrings in
451/// `message` that match the prepared `PATTERN`. All variants of the overloaded
452/// `match` method return the `k_STATUS_SUCCESS` status if a match is found,
453/// `k_STATUS_NO_MATCH` if a match is not found, and some other value if any
454/// error occurs. This value may help us to understand the reason of failure:
455/// @code
456/// bsl::vector<bsl::pair<size_t, size_t> > matchVector;
457/// returnValue = regEx.match(&matchVector, message, messageLength);
458///
459/// if (RegEx::k_STATUS_SUCCESS != returnValue) {
460/// if (RegEx::k_STATUS_NO_MATCH == returnValue) {
461/// // No match.
462/// return returnValue; // RETURN
463/// }
464/// else {
465/// // Some failure occurred during the function call.
466/// bsl::cout << "'RegEx::match' failed with the following"
467/// << " status: "
468/// << returnValue
469/// << bsl::endl;
470/// return returnValue; // RETURN
471/// }
472/// }
473/// @endcode
474/// Then we pass "subjectText" to the `subpatternIndex` method to obtain the
475/// index into `matchVector` that describes how to locate the subject text
476/// within `message`. The text is then extracted from `message` and assigned to
477/// the `result` "out" parameter:
478/// @code
479/// const bsl::pair<size_t, size_t> capturedSubject =
480/// matchVector[regEx.subpatternIndex("subjectText")];
481///
482/// *result = bsl::string(&message[capturedSubject.first],
483/// capturedSubject.second);
484///
485/// return 0;
486/// }
487/// @endcode
488/// The following array contains the sample Internet e-mail message from which
489/// we will extract the subject:
490/// @code
491/// const char RFC822_MESSAGE[] =
492/// "Received: ; Fri, 23 Apr 2004 14:30:00 -0400\r\n"
493/// "Message-ID: <12345@mailgate.bloomberg.net>\r\n"
494/// "Date: Fri, 23 Apr 2004 14:30:00 -0400\r\n"
495/// "From: <someone@bloomberg.net>\r\n"
496/// "To: <someone_else@bloomberg.net>\r\n"
497/// "Subject: This is the subject text\r\n"
498/// "MIME-Version: 1.0\r\n"
499/// "Content-Type: text/plain\r\n"
500/// "\r\n"
501/// "This is the message body.\r\n"
502/// ".\r\n";
503/// @endcode
504/// Finally, we call `parseSubject` to extract the subject from
505/// `RFC822_MESSAGE`. The assertions verify that the subject of the message is
506/// correctly extracted and assigned to the local `subject` variable:
507/// @code
508/// int main()
509/// {
510/// bsl::string subject;
511/// const int returnValue = parseSubject(&subject,
512/// RFC822_MESSAGE,
513/// sizeof(RFC822_MESSAGE) - 1);
514/// assert(0 == returnValue);
515/// assert(" This is the subject text" == subject);
516/// }
517/// @endcode
518///
519/// ### Appendix: Perl Compatibility {#bdlpcre_regex-appendix-perl-compatibility}
520///
521///
522/// This section describes the differences in the ways that PCRE2 and Perl
523/// handle regular expressions. The differences described here are with respect
524/// to Perl versions 5.10 and above.
525///
526/// 1) PCRE2 has only a subset of Perl's Unicode support.
527///
528/// 2) PCRE2 allows repeat quantifiers only on parenthesized assertions, but
529/// they do not mean what you might think. For example, `(?!a){3}` does not
530/// assert that the next three characters are not `"a"`. It just asserts that
531/// the next character is not `"a"` three times (in principle: PCRE2 optimizes
532/// this to run the assertion just once). Perl allows repeat quantifiers on
533/// other assertions such as '\b', but these do not seem to have any use.
534///
535/// 3) Capturing subpatterns that occur inside negative lookahead assertions are
536/// counted, but their entries in the offsets vector are never set. Perl
537/// sometimes (but not always) sets its numerical variables from inside negative
538/// assertions.
539///
540/// 4) The following Perl escape sequences are not supported: '\l', '\u', '\L',
541/// '\U', and '\N' when followed by a character name or Unicode value. ('\N' on
542/// its own, matching a non-newline character, is supported.) In fact these are
543/// implemented by Perl's general string-handling and are not part of its
544/// pattern matching engine. If any of these are encountered by PCRE2, an error
545/// is generated by default.
546///
547/// 5) The Perl escape sequences '\p,` `\P,` and `\X' are supported only if
548/// PCRE2 is built with Unicode support. The properties that can be tested with
549/// '\p' and '\P' are limited to the general category properties such as `Lu`
550/// and `Nd`, script names such as Greek or Han, and the derived properties
551/// `Any` and `L&`. PCRE2 does support the `Cs` (surrogate) property, which
552/// Perl does not; the Perl documentation says "Because Perl hides the need for
553/// the user to understand the internal representation of Unicode characters,
554/// there is no need to implement the somewhat messy concept of surrogates."
555///
556/// 6) PCRE2 does support the '\Q...\E' escape for quoting substrings.
557/// Characters in between are treated as literals. This is slightly different
558/// from Perl in that `$` and `@` are also handled as literals inside the
559/// quotes. In Perl, they cause variable interpolation (but of course PCRE2
560/// does not have variables). Note the following examples:
561/// @code
562/// Pattern PCRE2 matches Perl matches
563/// ---------------- ------------- ------------------------------------
564/// \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
565/// \Qabc\$xyz\E abc\$xyz abc\$xyz
566/// \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
567/// @endcode
568/// The '\Q...\E' sequence is recognized both inside and outside character
569/// classes.
570///
571/// 7) PCRE2 does not support the `(?{code})` and `(??{code})` constructions.
572/// However, there is support for recursive patterns. This is not available in
573/// Perl 5.8, but it is in Perl 5.10.
574///
575/// 8) Subroutine calls (whether recursive or not) are treated as atomic groups.
576/// Atomic recursion is like Python, but unlike Perl. Captured values that are
577/// set outside a subroutine call can be referenced from inside in PCRE2, but
578/// not in Perl.
579///
580/// 9) If any of the backtracking control verbs are used in a subpattern that is
581/// called as a subroutine (whether or not recursively), their effect is
582/// confined to that subpattern; it does not extend to the surrounding pattern.
583/// This is not always the case in Perl. In particular, if `(*THEN)` is present
584/// in a group that is called as a subroutine, its action is limited to that
585/// group, even if the group does not contain any `|` characters. Note that
586/// such subpatterns are processed as anchored at the point where they are
587/// tested.
588///
589/// 10) If a pattern contains more than one backtracking control verb, the first
590/// one that is backtracked onto acts. For example, in the pattern
591/// `A(*COMMIT)B(*PRUNE)C` a failure in `B` triggers `(*COMMIT),` but a failure
592/// in `C` triggers `(*PRUNE)`. Perl's behaviour is more complex; in many cases
593/// it is the same as PCRE2, but there are examples where it differs.
594///
595/// 11) Most backtracking verbs in assertions have their normal actions. They
596/// are not confined to the assertion.
597///
598/// 12) There are some differences that are concerned with the settings of
599/// captured strings when part of a pattern is repeated. For example, matching
600/// `"aba"` against the pattern `/^(a(b)?)+$/` in Perl leaves `$2` unset, but in
601/// PCRE2 it is set to `"b"`.
602///
603/// 13) PCRE2's handling of duplicate subpattern numbers and duplicate
604/// subpattern names is not as general as Perl's. This is a consequence of the
605/// fact the PCRE2 works internally just with numbers, using an external table
606/// to translate between numbers and names. In particular, a pattern such as
607/// `(?|(?<a>A)|(?<b)B)`, where the two capturing parentheses have the same
608/// number but different names, is not supported, and causes an error at compile
609/// time. If it were allowed, it would not be possible to distinguish which
610/// parentheses matched, because both names map to capturing subpattern number
611/// 1. To avoid this confusing situation, an error is given at compile time.
612///
613/// 14) Perl recognizes comments in some places that PCRE2 does not, for
614/// example, between the `(` and `?` at the start of a subpattern. If the `/x`
615/// modifier is set, Perl allows white space between `(` and `?` (though current
616/// Perls warn that this is deprecated) but PCRE2 never does, even if the
617/// `PCRE2_EXTENDED` option is set.
618///
619/// 15) Perl, when in warning mode, gives warnings for character classes such as
620/// `[A-\d]` or `[a-[:digit:]]`. It then treats the hyphens as literals. PCRE2
621/// has no warning features, so it gives an error in these cases because they
622/// are almost certainly user mistakes.
623///
624/// 16) In PCRE2, the upper/lower case character properties `Lu` and `Ll` are
625/// not affected when case-independent matching is specified. For example,
626/// '\p{Lu}' always matches an upper case letter.
627///
628/// 17) PCRE2 provides some extensions to the Perl regular expression
629/// facilities. This list is with respect to Perl 5.10:
630///
631/// (a) Although lookbehind assertions in PCRE2 must match fixed length strings,
632/// each alternative branch of a lookbehind assertion can match a different
633/// length of string. Perl requires them all to have the same length.
634///
635/// (b) If `PCRE2_DOLLAR_ENDONLY` is set and `PCRE2_MULTILINE` is not set, the
636/// `$` meta-character matches only at the very end of the string.
637///
638/// (c) A backslash followed by a letter with no special meaning is faulted.
639/// (Perl can be made to issue a warning.)
640///
641/// (d) If `PCRE2_UNGREEDY` is set, the greediness of the repetition quantifiers
642/// is inverted, that is, by default they are not greedy, but if followed by a
643/// question mark they are.
644///
645/// (e) `PCRE2_ANCHORED` can be used at matching time to force a pattern to be
646/// tried only at the first matching position in the subject string.
647///
648/// (f) The `PCRE2_NOTBOL`, `PCRE2_NOTEOL`, `PCRE2_NOTEMPTY`,
649/// `PCRE2_NOTEMPTY_ATSTART`, and `PCRE2_NO_AUTO_CAPTURE` options have no Perl
650/// equivalents.
651///
652/// (g) The '\R' escape sequence can be restricted to match only `CR,` `LF,` or
653/// `CRLF` by the `PCRE2_BSR_ANYCRLF` option.
654///
655/// (h) The callout facility is PCRE2-specific.
656///
657/// (i) The partial matching facility is PCRE2-specific.
658///
659/// (j) The alternative matching function (`pcre2_dfa_match()` matches in a
660/// different way and is not Perl-compatible.
661///
662/// (k) PCRE2 recognizes some special sequences such as `(*CR)` at the start of
663/// a pattern that set overall options that cannot be changed within the
664/// pattern.
665///
666/// ### Additional Copyright Notice {#bdlpcre_regex-additional-copyright-notice}
667///
668///
669/// @code
670/// Copyright (c) 1997-2015 University of Cambridge
671/// All rights reserved.
672///
673/// Redistribution and use in source and binary forms, with or without
674/// modification, are permitted provided that the following conditions are met:
675///
676/// * Redistributions of source code must retain the above copyright notice,
677/// this list of conditions and the following disclaimer.
678///
679/// * Redistributions in binary form must reproduce the above copyright
680/// notice, this list of conditions and the following disclaimer in the
681/// documentation and/or other materials provided with the distribution.
682///
683/// * Neither the name of the University of Cambridge nor the names of any
684/// contributors may be used to endorse or promote products derived from
685/// this software without specific prior written permission.
686///
687/// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
688/// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
689/// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
690/// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
691/// LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
692/// CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
693/// SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
694/// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
695/// CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
696/// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
697/// POSSIBILITY OF SUCH DAMAGE.
698///
699/// Copyright (c) 1997-2015 University of Cambridge
700/// @endcode
701/// @}
702/** @} */
703/** @} */
704
705/** @addtogroup bdl
706 * @{
707 */
708/** @addtogroup bdlpcre
709 * @{
710 */
711/** @addtogroup bdlpcre_regex
712 * @{
713 */
714
715#include <bdlscm_version.h>
716
717#include <bslma_allocator.h>
718#include <bslma_managedptr.h>
720
721#include <bslmf_enableif.h>
722#include <bslmf_issame.h>
724
726#include <bsls_libraryfeatures.h>
727
728#include <bsl_cstddef.h>
729#include <bsl_string.h>
730#include <bsl_string_view.h>
731#include <bsl_utility.h> // 'bsl::pair'
732#include <bsl_vector.h>
733
734#include <string>
735#include <vector>
736
737#ifndef _PCRE2_H
738#define PCRE2_CODE_UNIT_WIDTH 8
739#define PCRE2_STATIC
740#include <pcre2/pcre2.h>
741#endif
742
743#ifndef BDE_DONT_ALLOW_TRANSITIVE_INCLUDES
744#include <bsls_types.h>
745#endif
746
747
748namespace bdlpcre {
749
750class RegEx_MatchContext;
751
752 // ===========
753 // class RegEx
754 // ===========
755
756/// This class provides a mechanism for compiling and matching regular
757/// expressions. A regular expression approximately compatible with Perl
758/// 5.10 is compiled with the `prepare` method. Subsequently, strings are
759/// matched against the compiled (prepared) pattern using the overloaded
760/// `match` and `matchRaw` methods. Note that the underlying implementation
761/// uses the open-source Perl Compatible Regular Expressions (PCRE2) library
762/// that was developed at the University of Cambridge
763/// (`http://www.pcre.org/`).
764///
765/// See @ref bdlpcre_regex
766class RegEx {
767
768 // CLASS DATA
769 static
770 bsls::AtomicOperations::AtomicTypes::Int s_depthLimit; // process-wide
771 // default maximum
772 // evaluation
773 // recursion depth
774
775 // PRIVATE DATA
776 int d_flags; // prepare/match flags
777
778 bsl::string d_pattern; // regular expression pattern
779
780 pcre2_general_context *d_pcre2Context_p; // PCRE2 general context
781
782 pcre2_compile_context *d_compileContext_p; // PCRE2 compile context
783
784 pcre2_code *d_patternCode_p; // PCRE2 compiled pattern
785
786 int d_depthLimit; // evaluation recursion depth
787
788 size_t d_jitStackSize; // PCRE JIT stack size
789
791 d_matchContext; // match context helper
792
793 bslma::Allocator *d_allocator_p; // allocator to supply memory
794
795 private:
796 // NOT IMPLEMENTED
797 RegEx(const RegEx&);
798 RegEx& operator=(const RegEx&);
799
800 // PRIVATE MANIPULATORS
801
802 /// Prepare this regular-expression object with the specified `pattern`,
803 /// `flags`, and `jitStackSize` that indicates the size of the allocated
804 /// JIT stack to be used for `pattern`. On success, put this object
805 /// into the "prepared" state and return 0, with no effect on the
806 /// specified `errorBuffer` and `errorOffset`. Otherwise, (1) put this
807 /// object into the "unprepared" state, (2) load `errorBuffer` with a
808 /// message describing the error detected truncated to the specified
809 /// `errorBufferLength` (including a null terminator), (3) load
810 /// `errorOffset` with the offset in `pattern` at which the error was
811 /// detected, and (4) return a non-zero value. The behavior is
812 /// undefined unless `flags` is the bit-wise inclusive-or of 0 or more
813 /// of the following values:
814 /// @code
815 /// k_FLAG_CASELESS
816 /// k_FLAG_DOTMATCHESALL
817 /// k_FLAG_MULTILINE
818 /// k_FLAG_UTF8
819 /// k_FLAG_JIT
820 /// k_FLAG_DUPNAMES
821 /// @endcode
822 /// Note that the flag `k_FLAG_JIT` is ignored if `isJitAvailable()` is
823 /// `false`.
824 int prepareImp(char *errorBuffer,
825 size_t errorBufferLength,
826 size_t *errorOffset,
827 const char *pattern,
828 int flags,
829 size_t jitStackSize);
830
831 // PRIVATE ACCESSORS
832
833 /// Match the specified `subject`, having the specified `subjectLength`,
834 /// against the pattern held by this regular-expression object
835 /// (`pattern()`). `subject` need not be null-terminated and may
836 /// contain embedded null characters. The specified
837 /// `skipUTF8Validation` flag indicates whether UTF-8 string validity
838 /// checking is skipped. Begin matching at the specified
839 /// `subjectOffset` in `subject`. Return `k_STATUS_SUCCESS` on success,
840 /// `k_STATUS_NO_MATCH` if a match is not found, and another value if an
841 /// error occurs. If the returned status is not `k_STATUS_SUCCESS` or
842 /// `k_STATUS_NO_MATCH` it may match one of specific `k_STATUS_*` error
843 /// return constants defined below (but is not guaranteed to). The
844 /// behavior is undefined unless `true == isPrepared()`,
845 /// `subject || 0 == subjectLength`, `subjectOffset <= subjectLength`,
846 /// and `subject` is valid UTF-8 if `pattern()` was prepared with
847 /// `k_FLAG_UTF8` but `false == skipUTF8Validation`.
848 template <class RESULT_EXTRACTOR>
849 int matchImp(const RESULT_EXTRACTOR& extractor,
850 const char *subject,
851 size_t subjectLength,
852 size_t subjectOffset,
853 bool skipUTF8Validation) const;
854
855 /// `namedSubpatterns()` implementation.
856 template <class Vector>
857 void namedSubpatternsImp(Vector *result) const;
858
859 /// Replace parts of the specified `subject` that are matched with the
860 /// specified `replacement`. The specified bit mask of `options` flags
861 /// is used to configure the behavior of the replacement. `options`
862 /// should contain a bit-wise OR of the `k_REPLACE_*` constants defined
863 /// by this class, which indicate additional configuration parameters
864 /// for the replacement. If `options` has `k_REPLACE_GLOBAL` flag then
865 /// this function iterates over `subject`, replacing every matching
866 /// substring. If `k_REPLACE_GLOBAL` flag is not set, only the first
867 /// matching substring is replaced. The specified `skipUTF8Validation`
868 /// flag indicates whether UTF-8 `replacment` validity checking is
869 /// skipped. Return the number of substitutions that were carried out,
870 /// and load the specified `result` with the result of the replacement.
871 /// Otherwise, if an error occurs, return a negative value. If that
872 /// error is a syntax error in `replacement`, load the specified
873 /// `errorOffset` (if non-null) with the offset in'replacement' where
874 /// the error was detected; for other errors, such as invalid `subject`
875 /// or `replacement` UTF-8 string, load `errorOffset` with a negative
876 /// value. The behavior is undefined unless `true == isPrepared()`.
877 /// Note that if the size of `result` is too small to fit the resultant
878 /// string then this method computes the size of `result` and adjusts it
879 /// to the size that is needed. To avoid automatic calculation and
880 /// adjustment which may introduce a performace penalty, it is
881 /// recommended that the size of `result` has enough room to fit the
882 /// zero-terminating character.
883 template <class STRING>
884 int replaceImp(STRING *result,
885 int *errorOffset,
886 const bsl::string_view& subject,
887 const bsl::string_view& replacement,
888 size_t options,
889 bool skipUTF8Validation) const;
890
891 public:
892 // TRAITS
894
895 // CONSTANTS
896 enum {
897 // This enumeration defines the flags that may be supplied to 'prepare'
898 // to affect specific pattern matching behavior.
899
900 k_FLAG_CASELESS = 1 << 0, // case-insensitive matching
901
902 k_FLAG_DOTMATCHESALL = 1 << 1, // dot metacharacter matches all chars
903 // (including newlines)
904
905 k_FLAG_MULTILINE = 1 << 2, // multi-line matching
906
907 k_FLAG_UTF8 = 1 << 3, // UTF-8 support
908
909 k_FLAG_JIT = 1 << 4, // just-in-time compiling optimization
910 // requested
911
912 k_FLAG_DUPNAMES = 1 << 5 // allow duplicate named groups
913 // (sub-patterns)
914 };
915
916 enum {
917 // This enumeration defines the flags that may be supplied to 'replace'
918 // to affect specific replacement behavior.
919
920 k_REPLACE_LITERAL = 1 << 0, // the replacement string is literal
921
922 k_REPLACE_GLOBAL = 1 << 1, // replace all occurrences in the
923 // subject
924
925 k_REPLACE_EXTENDED = 1 << 2, // do extended replacement
926 // processing
927
928 k_REPLACE_UNKNOWN_UNSET = 1 << 3, // treat unknown group as unset
929
930 k_REPLACE_UNSET_EMPTY = 1 << 4 // simple unset insert = empty
931 // string
932 };
933
934 enum {
935 // Enumeration used to distinguish among results of match operations.
936
937 /// successful completion of the operation
939
940 /// the subject string did not match the pattern
942
943 /// `depthLimit()` was exceeded
945
946 /// memory available for the JIT stack is not large enough
947 /// (applicable only if `pattern()` was prepared with `k_FLAG_JIT`)
949
950 /// the UTF-8 string ends with a truncated UTF-8 character
952
953 /// the two most significant bits of the 2nd, 3rd or 4th byte of the
954 /// UTF-8 character do not have the binary value 0b10
956
957 /// a UTF-8 character is either 5 or 6 bytes long
959
960 /// a 4-byte UTF-8 character has a value greater than 0x10ffff
962
963 /// a 3-byte UTF-8 character has a value in the range 0xd800 to
964 /// 0xdfff
966
967 /// a 2-, 3- or 4-byte UTF-8 character is "overlong", i.e. it codes
968 /// for a value that can be represented by fewer bytes
970
971 /// the two most significant bits of the first byte of a UTF-8
972 /// character have the binary value 0b10
974
975 /// the first byte of a UTF-8 character has the value 0xfe or 0xff
977 };
978
979 /// Value used to denote an invalid offset for match methods returning
980 /// pairs.
981 static const size_t k_INVALID_OFFSET;
982
983 // CLASS METHODS
984
985 /// Return the process-wide default evaluation recursion depth limit.
986 static int defaultDepthLimit();
987
988 /// Return `true` if just-in-time compiling optimization is supported by
989 /// current hardware platform and `false` otherwise. Note that JIT
990 /// support is limited to the following hardware platforms:
991 /// @code
992 /// ARM 32-bit (v5, v7, and Thumb2)
993 /// ARM 64-bit
994 /// Intel x86 32-bit and 64-bit
995 /// MIPS 32-bit and 64-bit
996 /// Power PC 32-bit and 64-bit
997 /// SPARC 32-bit
998 /// @endcode
999 static bool isJitAvailable();
1000
1001 /// Set the process-wide default evaluation recursion depth limit to the
1002 /// specified `depthLimit`. Return the previous depth limit.
1003 static int setDefaultDepthLimit(int depthLimit);
1004
1005 // CREATORS
1006
1007 /// Create a regular-expression object in the "unprepared" state.
1008 /// Optionally specify a `basicAllocator` used to supply memory. The
1009 /// alignment strategy of the allocator must be "maximum" or "natural".
1010 /// If `basicAllocator` is 0, the currently installed default allocator
1011 /// is used.
1012 RegEx(bslma::Allocator *basicAllocator = 0); // IMPLICIT
1013
1014 /// Destroy this regular-expression object.
1015 ~RegEx();
1016
1017 // MANIPULATORS
1018
1019 void clear();
1020 /// Free resources used by this regular-expression object and put this
1021 /// object into the "unprepared" state. This method has no effect if
1022 /// this object is already in the "unprepared" state.
1023
1024 int prepare(bsl::nullptr_t errorMessage,
1025 size_t *errorOffset,
1026 const char *pattern,
1027 int flags = 0,
1028 size_t jitStackSize = 0);
1029
1030 /// Prepare this regular-expression object with the specified `pattern`
1031 /// and the optionally specified `flags`. `flags`, if supplied, should
1032 /// contain a bit-wise or of the `k_FLAG_*` constants defined by this
1033 /// class, which indicate additional configuration parameters for the
1034 /// regular expression. Optionally specify `jitStackSize`. If `flags`
1035 /// has the `k_FLAG_JIT` flag set, `jitStackSize` indicates the size of
1036 /// the allocated JIT stack to be used for this pattern. If `flags`
1037 /// has the `k_FLAG_JIT` bit set and `jitStackSize` is 0 (or not
1038 /// supplied), no memory will be allocated for the JIT stack and the
1039 /// program stack will be used as the JIT stack. If `flags` does not
1040 /// have `k_FLAG_JIT` set, or `isJitAvailable()` is `false`, the
1041 /// `jitStackSize` parameter, if supplied, is ignored. On success, put
1042 /// this object into the "prepared" state and return 0, with no effect
1043 /// on the specified `errorMessage` and `errorOffset`. Otherwise, (1)
1044 /// put this object into the "unprepared" state, (2) load `errorMessage`
1045 /// (if non-null) with a string describing the error detected, (3) load
1046 /// `errorOffset` (if non-null) with the offset in `pattern` at which
1047 /// the error was detected, and (4) return a non-zero value. The
1048 /// behavior is undefined unless `flags` is the bit-wise inclusive-or of
1049 /// 0 or more of the following values:
1050 /// @code
1051 /// k_FLAG_CASELESS
1052 /// k_FLAG_DOTMATCHESALL
1053 /// k_FLAG_MULTILINE
1054 /// k_FLAG_UTF8
1055 /// k_FLAG_JIT
1056 /// k_FLAG_DUPNAMES
1057 /// @endcode
1058 /// Note that the flag `k_FLAG_JIT` is ignored if `isJitAvailable()` is
1059 /// `false`.
1060 template <class STRING>
1063#ifdef BSLS_LIBRARYFEATURES_HAS_CPP17_PMR_STRING
1065#endif
1066 , int>::type
1067 prepare(STRING *errorMessage,
1068 size_t *errorOffset,
1069 const char *pattern,
1070 int flags = 0,
1071 size_t jitStackSize = 0);
1072
1073 /// Set the evaluation recursion depth limit for this regular-expression
1074 /// object to the specified `depthLimit`. Return the previous depth
1075 /// limit.
1077
1078 // ACCESSORS
1079
1080 /// Return the evaluation recursion depth limit for this
1081 /// regular-expression object.
1082 int depthLimit() const;
1083
1084 /// Return the flags that were supplied to the most recent successful
1085 /// call to the `prepare` method of this regular-expression object. The
1086 /// behavior is undefined unless `isPrepared() == true`. Note that the
1087 /// returned value will be the bit-wise inclusive-or of 0 or more of the
1088 /// following values:
1089 /// @code
1090 /// k_FLAG_CASELESS
1091 /// k_FLAG_DOTMATCHESALL
1092 /// k_FLAG_MULTILINE
1093 /// k_FLAG_UTF8
1094 /// k_FLAG_JIT
1095 /// k_FLAG_DUPNAMES
1096 /// @endcode
1097 /// Also note that `k_FLAG_JIT` is ignored, but still returned by this
1098 /// method, if `isJitAvailable()` is `false`.
1099 int flags() const;
1100
1101 /// Return `true` if this regular-expression object is in the "prepared"
1102 /// state, and `false` otherwise.
1103 bool isPrepared() const;
1104
1105 /// Return the size of the dynamically allocated JIT stack if it has
1106 /// been specified explicitly with the `prepare` method. Return 0 if a
1107 /// zero `jitStackSize` value was passed to the `prepare` method (or not
1108 /// supplied at all) or if `isPrepared()` is `false`.
1109 size_t jitStackSize() const;
1110
1111 /// Match the specified `subject` against `pattern()`. Begin matching
1112 /// at the optionally specified `subjectOffset` in `subject`. If
1113 /// `subjectOffset` is not specified, matching begins at the start of
1114 /// `subject`. UTF-8 validity checking is performed on `subject` if
1115 /// `pattern()` was prepared with `k_FLAG_UTF8`. Return
1116 /// `k_STATUS_SUCCESS` on success, `k_STATUS_NO_MATCH` if a match is not
1117 /// found, and another value if an error occurs. If the returned status
1118 /// is not `k_STATUS_SUCCESS` or `k_STATUS_NO_MATCH` it may match one of
1119 /// specific `k_STATUS_*` error return constants defined above (but is
1120 /// not guaranteed to). The behavior is undefined unless
1121 /// `true == isPrepared()` and `subjectOffset <= subject.length()`.
1122 /// Note that JIT optimization is disabled if `pattern()` was prepared
1123 /// with `k_FLAG_UTF8`; use `matchRaw` if JIT is preferred and UTF-8
1124 /// validation of `subject` is not required.
1125 int match(const bsl::string_view& subject,
1126 size_t subjectOffset = 0) const;
1127
1128 /// Match the specified `subject` having the specified `subjectLength`
1129 /// against `pattern()`. Begin matching at the optionally specified
1130 /// `subjectOffset` in `subject`. If `subjectOffset` is not specified,
1131 /// matching begins at the start of `subject`. `subject` may contain
1132 /// embedded null characters. UTF-8 validity checking is performed on
1133 /// `subject` if `pattern()` was prepared with `k_FLAG_UTF8`. Return
1134 /// `k_STATUS_SUCCESS` on success, `k_STATUS_NO_MATCH` if a match is not
1135 /// found, and another value if an error occurs. If the returned status
1136 /// is not `k_STATUS_SUCCESS` or `k_STATUS_NO_MATCH` it may match one of
1137 /// specific `k_STATUS_*` error return constants defined above (but is
1138 /// not guaranteed to). The behavior is undefined unless
1139 /// `true == isPrepared()`, `subject || 0 == subjectLength`, and
1140 /// `subjectOffset <= subjectLength`. Note that JIT optimization is
1141 /// disabled if `pattern()` was prepared with `k_FLAG_UTF8`; use
1142 /// `matchRaw` if JIT is preferred and UTF-8 validation of `subject` is
1143 /// not required.
1144 int match(const char *subject,
1145 size_t subjectLength,
1146 size_t subjectOffset = 0) const;
1147
1148 /// Match the specified `subject` having the specified `subjectLength`
1149 /// against `pattern()`. Begin matching at the optionally specified
1150 /// `subjectOffset` in `subject`. If `subjectOffset` is not specified,
1151 /// matching begins at the start of `subject`. `subject` may contain
1152 /// embedded null characters. UTF-8 validity checking is performed on
1153 /// `subject` if `pattern()` was prepared with `k_FLAG_UTF8`. Return
1154 /// `k_STATUS_SUCCESS` on success, `k_STATUS_NO_MATCH` if a match is not
1155 /// found, and another value if an error occurs. If the returned status
1156 /// is not `k_STATUS_SUCCESS` or `k_STATUS_NO_MATCH` it may match one of
1157 /// specific `k_STATUS_*` error return constants defined above (but is
1158 /// not guaranteed to). `result` is unchanged if a value other than
1159 /// `k_STATUS_SUCCESS` is returned. The behavior is undefined unless
1160 /// `true == isPrepared()`, `subject || 0 == subjectLength`, and
1161 /// `subjectOffset <= subjectLength`. Note that JIT optimization is
1162 /// disabled if `pattern()` was prepared with `k_FLAG_UTF8`; use
1163 /// `matchRaw` if JIT is preferred and UTF-8 validation of `subject` is
1164 /// not required.
1166 const char *subject,
1167 size_t subjectLength,
1168 size_t subjectOffset = 0) const;
1170 const char *subject,
1171 size_t subjectLength,
1172 size_t subjectOffset = 0) const;
1173
1174 /// Match the specified `subject` against `pattern()`. Begin matching
1175 /// at the optionally specified `subjectOffset` in `subject`. If
1176 /// `subjectOffset` is not specified, matching begins at the start of
1177 /// `subject`. UTF-8 validity checking is performed on `subject` if
1178 /// `pattern()` was prepared with `k_FLAG_UTF8`. Return
1179 /// `k_STATUS_SUCCESS` on success, `k_STATUS_NO_MATCH` if a match is not
1180 /// found, and another value if an error occurs. If the returned status
1181 /// is not `k_STATUS_SUCCESS` or `k_STATUS_NO_MATCH` it may match one of
1182 /// specific `k_STATUS_*` error return constants defined above (but is
1183 /// not guaranteed to). `result` is unchanged if a value other than
1184 /// `k_STATUS_SUCCESS` is returned. The behavior is undefined unless
1185 /// `true == isPrepared()` and `subjectOffset <= subject.length()`.
1186 /// Note that JIT optimization is disabled if `pattern()` was prepared
1187 /// with `k_FLAG_UTF8`; use `matchRaw` if JIT is preferred and UTF-8
1188 /// validation of `subject` is not required.
1190 const bsl::string_view& subject,
1191 size_t subjectOffset = 0) const;
1192
1193 /// Match the specified `subject` having the specified `subjectLength`
1194 /// against `pattern()`. Begin matching at the optionally specified
1195 /// `subjectOffset` in `subject`. If `subjectOffset` is not specified,
1196 /// matching begins at the start of `subject`. `subject` may contain
1197 /// embedded null characters. UTF-8 validity checking is performed on
1198 /// `subject` if `pattern()` was prepared with `k_FLAG_UTF8`. On
1199 /// success:
1200 ///
1201 /// 1. Load the first element of the specified `result` with,
1202 /// respectively, a `(offset, length)` pair or a `bslstl::StringRef`
1203 /// indicating the leftmost match of `pattern()`.
1204 /// 2. Load elements of `result` in the range `[1 .. numSubpatterns()]`
1205 /// with, respectively, a `(offset, length)` pair or a
1206 /// `bslstl::StringRef` indicating the respective matches of
1207 /// sub-patterns (unmatched sub-patterns have their respective
1208 /// `result` elements loaded with either the `(k_INVALID_OFFSET, 0)`
1209 /// pair or an empty `bslstl::StringRef`); sub-patterns matching
1210 /// multiple times have their respective `result` elements loaded
1211 /// with the pairs or `bslstl::StringRef` indicating the rightmost
1212 /// match, and return `k_STATUS_SUCCESS`.
1213 ///
1214 /// Otherwise, return `k_STATUS_NO_MATCH` if a match is not found, and
1215 /// another value if an error occurs. If the returned status is not
1216 /// `k_STATUS_SUCCESS` or `k_STATUS_NO_MATCH` it may match one of
1217 /// specific `k_STATUS_*` error return constants defined above (but is
1218 /// not guaranteed to). `result` is unchanged if a value other than
1219 /// `k_STATUS_SUCCESS` is returned. The behavior is undefined unless
1220 /// `true == isPrepared()`, `subject || 0 == subjectLength`, and
1221 /// `subjectOffset <= subjectLength`. Note that JIT optimization is
1222 /// disabled if `pattern()` was prepared with `k_FLAG_UTF8`; use
1223 /// `matchRaw` if JIT is preferred and UTF-8 validation of `subject` is
1224 /// not required. Also note that after a successful call, `result` will
1225 /// contain exactly `numSubpatterns() + 1` elements.
1227 const char *subject,
1228 size_t subjectLength,
1229 size_t subjectOffset = 0)
1230 const;
1232 const char *subject,
1233 size_t subjectLength,
1234 size_t subjectOffset = 0)
1235 const;
1236
1238 const bsl::string_view& subject,
1239 size_t subjectOffset = 0) const;
1240 int match(std::vector<bsl::string_view> *result,
1241 const bsl::string_view& subject,
1242 size_t subjectOffset = 0) const;
1243#ifdef BSLS_LIBRARYFEATURES_HAS_CPP17_PMR
1244 int match(std::pmr::vector<bsl::string_view> *result,
1245 const bsl::string_view& subject,
1246 size_t subjectOffset = 0) const;
1247#endif
1248 // Match the specified 'subject' against 'pattern()'. Begin matching
1249 // at the optionally specified 'subjectOffset' in 'subject'. If
1250 // 'subjectOffset' is not specified, matching begins at the start of
1251 // 'subject'. UTF-8 validity checking is performed on 'subject' if
1252 // 'pattern()' was prepared with 'k_FLAG_UTF8'. On success:
1253 //
1254 //: 1 Load the first element of the specified 'result' with a
1255 //: 'bsl::string_view' indicating the leftmost match of 'pattern()'.
1256 //:
1257 //: 2 Load elements of 'result' in the range '[1 .. numSubpatterns()]'
1258 //: with a 'bsl::string_view' indicating the respective matches of
1259 //: sub-patterns (unmatched sub-patterns have their respective
1260 //: 'result' elements loaded with an empty 'bsl::string_view');
1261 //: sub-patterns matching multiple times have their respective
1262 //: 'result' elements loaded with a 'bsl::string_view' indicating the
1263 //: rightmost match, and return 'k_STATUS_SUCCESS'.
1264 //
1265 // Otherwise, return 'k_STATUS_NO_MATCH' if a match is not found, and
1266 // another value if an error occurs. If the returned status is not
1267 // 'k_STATUS_SUCCESS' or 'k_STATUS_NO_MATCH' it may match one of
1268 // specific 'k_STATUS_*' error return constants defined above (but is
1269 // not guaranteed to). 'result' is unchanged if a value other than
1270 // 'k_STATUS_SUCCESS' is returned. The behavior is undefined unless
1271 // 'true == isPrepared()' and 'subjectOffset <= subject.length()'. Note
1272 // that JIT optimization is disabled if 'pattern()' was prepared with
1273 // 'k_FLAG_UTF8'; use 'matchRaw' if JIT is preferred and UTF-8
1274 // validation of 'subject' is not required. Also note that after a
1275 // successful call, 'result' will contain exactly
1276 // 'numSubpatterns() + 1' elements.
1277
1278 /// Match the specified `subject` against `pattern()`. Begin matching
1279 /// at the optionally specified `subjectOffset` in `subject`. If
1280 /// `subjectOffset` is not specified, matching begins at the start of
1281 /// `subject`. Return `k_STATUS_SUCCESS` on success,
1282 /// `k_STATUS_NO_MATCH` if a match is not found, and another value if an
1283 /// error occurs. If the returned status is not `k_STATUS_SUCCESS` or
1284 /// `k_STATUS_NO_MATCH` it may match one of specific `k_STATUS_*` error
1285 /// return constants defined above (but is not guaranteed to). The
1286 /// behavior is undefined unless `true == isPrepared()`,
1287 /// `subjectOffset <= subject.length()`, and `subject` is valid UTF-8 if
1288 /// `pattern()` was prepared with `k_FLAG_UTF8`.
1289 int matchRaw(const bsl::string_view& subject,
1290 size_t subjectOffset = 0) const;
1291
1292 /// Match the specified `subject` having the specified `subjectLength`
1293 /// against `pattern()`. Begin matching at the optionally specified
1294 /// `subjectOffset` in `subject`. If `subjectOffset` is not specified,
1295 /// matching begins at the start of `subject`. `subject` may contain
1296 /// embedded null characters. Return `k_STATUS_SUCCESS` on success,
1297 /// `k_STATUS_NO_MATCH` if a match is not found, and another value if an
1298 /// error occurs. If the returned status is not `k_STATUS_SUCCESS` or
1299 /// `k_STATUS_NO_MATCH` it may match one of specific `k_STATUS_*` error
1300 /// return constants defined above (but is not guaranteed to). The
1301 /// behavior is undefined unless `true == isPrepared()`,
1302 /// `subject || 0 == subjectLength`, `subjectOffset <= subjectLength`,
1303 /// and `subject` is valid UTF-8 if `pattern()` was prepared with
1304 /// `k_FLAG_UTF8`.
1305 int matchRaw(const char *subject,
1306 size_t subjectLength,
1307 size_t subjectOffset = 0) const;
1308
1309 /// Match the specified `subject` having the specified `subjectLength`
1310 /// against `pattern()`. Begin matching at the optionally specified
1311 /// `subjectOffset` in `subject`. If `subjectOffset` is not specified,
1312 /// matching begins at the start of `subject`. `subject` may contain
1313 /// embedded null characters. Return `k_STATUS_SUCCESS` on success,
1314 /// `k_STATUS_NO_MATCH` if a match is not found, and another value if an
1315 /// error occurs. If the returned status is not `k_STATUS_SUCCESS` or
1316 /// `k_STATUS_NO_MATCH` it may match one of specific `k_STATUS_*` error
1317 /// return constants defined above (but is not guaranteed to). `result`
1318 /// is unchanged if a value other than `k_STATUS_SUCCESS` is returned.
1319 /// The behavior is undefined unless `true == isPrepared()`,
1320 /// `subject || 0 == subjectLength`, `subjectOffset <= subjectLength`,
1321 /// and `subject` is valid UTF-8 if `pattern()` was prepared with
1322 /// `k_FLAG_UTF8`.
1324 const char *subject,
1325 size_t subjectLength,
1326 size_t subjectOffset = 0) const;
1328 const char *subject,
1329 size_t subjectLength,
1330 size_t subjectOffset = 0) const;
1331
1332 /// Match the specified `subject` against `pattern()`. Begin matching
1333 /// at the optionally specified `subjectOffset` in `subject`. If
1334 /// `subjectOffset` is not specified, matching begins at the start of
1335 /// `subject`. Return `k_STATUS_SUCCESS` on success,
1336 /// `k_STATUS_NO_MATCH` if a match is not found, and another value if an
1337 /// error occurs. If the returned status is not `k_STATUS_SUCCESS` or
1338 /// `k_STATUS_NO_MATCH` it may match one of specific `k_STATUS_*` error
1339 /// return constants defined above (but is not guaranteed to). `result`
1340 /// is unchanged if a value other than `k_STATUS_SUCCESS` is returned.
1341 /// The behavior is undefined unless `true == isPrepared()`,
1342 /// `subjectOffset <= subject.length()`, and `subject` is valid UTF-8 if
1343 /// `pattern()` was prepared with `k_FLAG_UTF8`.
1345 const bsl::string_view& subject,
1346 size_t subjectOffset = 0) const;
1347
1348 /// Match the specified `subject` having the specified `subjectLength`
1349 /// against `pattern()`. Begin matching at the optionally specified
1350 /// `subjectOffset` in `subject`. If `subjectOffset` is not specified,
1351 /// matching begins at the start of `subject`. `subject` may contain
1352 /// embedded null characters. On success:
1353 ///
1354 /// 1. Load the first element of the specified `result` with,
1355 /// respectively, a `(offset, length)` pair or a `bslstl::StringRef`
1356 /// indicating the leftmost match of `pattern()`.
1357 /// 2. Load elements of `result` in the range `[1 .. numSubpatterns()]`
1358 /// with, respectively, a `(offset, length)` pair or a
1359 /// `bslstl::StringRef` indicating the respective matches of
1360 /// sub-patterns (unmatched sub-patterns have their respective
1361 /// `result` elements loaded with either the `(k_INVALID_OFFSET, 0)`
1362 /// pair or an empty `bslstl::StringRef`); sub-patterns matching
1363 /// multiple times have their respective `result` elements loaded
1364 /// with the pairs or `bslstl::StringRef` indicating the rightmost
1365 /// match, and return `k_STATUS_SUCCESS`.
1366 ///
1367 /// Otherwise, return `k_STATUS_NO_MATCH` if a match is not found, and
1368 /// another value if an error occurs. If the returned status is not
1369 /// `k_STATUS_SUCCESS` or `k_STATUS_NO_MATCH` it may match one of
1370 /// specific `k_STATUS_*` error return constants defined above (but is
1371 /// not guaranteed to). `result` is unchanged if a value other than
1372 /// `k_STATUS_SUCCESS` is returned. The behavior is undefined unless
1373 /// `true == isPrepared()`, `subject || 0 == subjectLength`,
1374 /// `subjectOffset <= subjectLength`, and `subject` is valid UTF-8 if
1375 /// `pattern()` was prepared with `k_FLAG_UTF8`. Note that after a
1376 /// successful call, `result` will contain exactly
1377 /// `numSubpatterns() + 1` elements.
1379 const char *subject,
1380 size_t subjectLength,
1381 size_t subjectOffset = 0)
1382 const;
1384 const char *subject,
1385 size_t subjectLength,
1386 size_t subjectOffset = 0)
1387 const;
1388
1390 const bsl::string_view& subject,
1391 size_t subjectOffset = 0)
1392 const;
1393 int matchRaw(std::vector<bsl::string_view> *result,
1394 const bsl::string_view& subject,
1395 size_t subjectOffset = 0)
1396 const;
1397#ifdef BSLS_LIBRARYFEATURES_HAS_CPP17_PMR
1398 int matchRaw(std::pmr::vector<bsl::string_view> *result,
1399 const bsl::string_view& subject,
1400 size_t subjectOffset = 0)
1401 const;
1402#endif
1403 // Match the specified 'subject' against 'pattern()'. Begin matching
1404 // at the optionally specified 'subjectOffset' in 'subject'. If
1405 // 'subjectOffset' is not specified, matching begins at the start of
1406 // 'subject'. On success:
1407 //
1408 //: 1 Load the first element of the specified 'result' with a
1409 //: 'bsl::string_view' indicating the leftmost match of 'pattern()'.
1410 //:
1411 //: 2 Load elements of 'result' in the range '[1 .. numSubpatterns()]'
1412 //: with a 'bsl::string_view' indicating the respective matches of
1413 //: sub-patterns (unmatched sub-patterns have their respective
1414 //: 'result' elements loaded with an empty 'bsl::string_view');
1415 //: sub-patterns matching multiple times have their respective
1416 //: 'result' elements loaded with a 'bsl::string_view' indicating the
1417 //: rightmost match, and return 'k_STATUS_SUCCESS'.
1418 //
1419 // Otherwise, return 'k_STATUS_NO_MATCH' if a match is not found, and
1420 // another value if an error occurs. If the returned status is not
1421 // 'k_STATUS_SUCCESS' or 'k_STATUS_NO_MATCH' it may match one of
1422 // specific 'k_STATUS_*' error return constants defined above (but is
1423 // not guaranteed to). 'result' is unchanged if a value other than
1424 // 'k_STATUS_SUCCESS' is returned. The behavior is undefined unless
1425 // 'true == isPrepared()', 'subjectOffset <= subject.length()', and
1426 // 'subject' is valid UTF-8 if 'pattern()' was prepared with
1427 // 'k_FLAG_UTF8'. Also note that after a successful call, 'result'
1428 // will contain exactly 'numSubpatterns() + 1' elements.
1429
1433 std::vector<std::pair<bsl::string_view, int> > *result) const;
1434#ifdef BSLS_LIBRARYFEATURES_HAS_CPP17_PMR
1435 void namedSubpatterns(
1436 std::pmr::vector<std::pair<bsl::string_view, int> > *result) const;
1437#endif
1438 // Load into the specified 'result' the mapping between the sub-pattern
1439 // names and their indices. The names are in alphabetical order. If
1440 // duplicate named groups were enabled for this regular expression (see
1441 // component documentation for {Allow Duplicate Named Groups
1442 // (sub-patterns)} then a sub-pattern name may appear multiple times.
1443 // The behavior is undefined unless 'isPrepared()' is 'true'.
1444
1445 /// Return the number of sub-patterns in the pattern held by this
1446 /// regular-expression object (`pattern()`). The behavior is undefined
1447 /// unless `isPrepared() == true`.
1448 int numSubpatterns() const;
1449
1450 /// Return a reference to the non-modifiable pattern held by this
1451 /// regular-expression object. The behavior is undefined unless
1452 /// `isPrepared() == true`.
1453 const bsl::string& pattern() const;
1454
1456 int *errorOffset,
1457 const bsl::string_view& subject,
1458 const bsl::string_view& replacement,
1459 size_t options = 0) const;
1460 int replace(std::string *result,
1461 int *errorOffset,
1462 const bsl::string_view& subject,
1463 const bsl::string_view& replacement,
1464 size_t options = 0) const;
1465#ifdef BSLS_LIBRARYFEATURES_HAS_CPP17_PMR_STRING
1466 int replace(std::pmr::string *result,
1467 int *errorOffset,
1468 const bsl::string_view& subject,
1469 const bsl::string_view& replacement,
1470 size_t options = 0) const;
1471#endif
1472 // Replace parts of the specified 'subject' that are matched with the
1473 // specified 'replacement'. Optionally specify a bit mask of 'options'
1474 // flags that configure the behavior of the replacement. 'options'
1475 // should contain a bit-wise OR of the 'k_REPLACE_*' constants defined
1476 // by this class, which indicate additional configuration parameters
1477 // for the replacement. If 'options' has 'k_REPLACE_GLOBAL' flag then
1478 // this function iterates over 'subject', replacing every matching
1479 // substring. If 'k_REPLACE_GLOBAL' flag is not set, only the first
1480 // matching substring is replaced. UTF-8 validity checking is
1481 // performed on 'subject' and 'replacement' if 'pattern()' was prepared
1482 // with 'k_FLAG_UTF8'. Return the number of substitutions that were
1483 // carried out on success, and load the specified 'result' with the
1484 // result of the replacement. Otherwise, if an error occurs, return a
1485 // negative value. If that error is a syntax error in 'replacement',
1486 // load the specified 'errorOffset' (if non-null) with the offset in
1487 // 'replacement' where the error was detected; for other errors, such
1488 // as invalid 'subject' or 'replacement' UTF-8 string, load
1489 // 'errorOffset' with a negative value. The behavior is undefined
1490 // unless 'true == isPrepared()'. Note that if the size of 'result' is
1491 // too small to fit the resultant string then this method computes the
1492 // size of 'result' and adjusts it to the size that is needed. To
1493 // avoid automatic calculation and adjustment which may introduce a
1494 // performance penalty, it is recommended that the size of 'result' has
1495 // enough room to fit the resulting string including a zero-terminating
1496 // character.
1497
1499 int *errorOffset,
1500 const bsl::string_view& subject,
1501 const bsl::string_view& replacement,
1502 size_t options = 0) const;
1503 int replaceRaw(std::string *result,
1504 int *errorOffset,
1505 const bsl::string_view& subject,
1506 const bsl::string_view& replacement,
1507 size_t options = 0) const;
1508#ifdef BSLS_LIBRARYFEATURES_HAS_CPP17_PMR_STRING
1509 int replaceRaw(std::pmr::string *result,
1510 int *errorOffset,
1511 const bsl::string_view& subject,
1512 const bsl::string_view& replacement,
1513 size_t options = 0) const;
1514#endif
1515 // Replace parts of the specified 'subject' that are matched with the
1516 // specified 'replacement'. Optionally specify a bit mask of 'options'
1517 // flags that configure the behavior of the replacement. 'options'
1518 // should contain a bit-wise OR of the 'k_REPLACE_*' constants defined
1519 // by this class, which indicate additional configuration parameters
1520 // for the replacement. If 'options' has 'k_REPLACE_GLOBAL' flag then
1521 // this function iterates over 'subject', replacing every matching
1522 // substring. If 'k_REPLACE_GLOBAL' flag is not set, only the first
1523 // matching substring is replaced. UTF-8 validity checking is
1524 // performed on 'subject' if 'pattern()' was prepared with
1525 // 'k_FLAG_UTF8'. Return the number of substitutions that were carried
1526 // out on success, and load the specified 'result' with the result of
1527 // the replacement. Otherwise, if an error occurs, return a negative
1528 // value. If that error is a syntax error in 'replacement', load the
1529 // specified 'errorOffset' (if non-null) with the offset in
1530 // 'replacement' where the error was detected; for other errors, such
1531 // as invalid 'subject' UTF-8 string, load 'errorOffset' with a
1532 // negative value. The behavior is undefined unless
1533 // 'true == isPrepared()'. Note that if the size of 'result' is too
1534 // small to fit the resultant string then this method computes the size
1535 // of 'result' and adjusts it to the size that is needed. To avoid
1536 // automatic calculation and adjustment which may introduce a
1537 // performance penalty, it is recommended that the size of 'result' has
1538 // enough room to fit the resulting string including a zero-terminating
1539 // character.
1540
1541 /// Return the 1-based index of the sub-pattern having the specified
1542 /// `name` in the pattern held by this regular-expression object
1543 /// (`pattern()`); return -1 if `pattern()` does not contain a
1544 /// sub-pattern identified by `name` or `name` is not unique. The
1545 /// behavior is undefined unless `isPrepared() == true`. Note that the
1546 /// returned value is intended to be used as an index into the
1547 /// `bsl::vector<bsl::pair<int, int> >` returned by `match`. Also note
1548 /// that the function `namedSubpatterns` can be used to find the
1549 /// sub-pattern index when there are duplicate named sub-patterns.
1550 int subpatternIndex(const char *name) const;
1551};
1552
1553// ============================================================================
1554// INLINE DEFINITIONS
1555// ============================================================================
1556
1557 // -----------
1558 // class RegEx
1559 // -----------
1560
1561// CLASS METHODS
1562inline
1564{
1565 return bsls::AtomicOperations::getIntRelaxed(&s_depthLimit);
1566}
1567
1568inline
1570{
1571 int previous = defaultDepthLimit();
1572
1574
1575 return previous;
1576}
1577
1578// CREATORS
1579inline
1581{
1582 clear();
1583 pcre2_compile_context_free(d_compileContext_p);
1584 pcre2_general_context_free(d_pcre2Context_p);
1585}
1586
1587// MANIPULATORS
1588template <class STRING>
1591#ifdef BSLS_LIBRARYFEATURES_HAS_CPP17_PMR_STRING
1593#endif
1594 , int>::type
1595RegEx::prepare(STRING *errorMessage,
1596 size_t *errorOffset,
1597 const char *pattern,
1598 int flags,
1599 size_t jitStackSize)
1600{
1601 const int k_BUFFER_LEN = 256;
1602 char buffer[k_BUFFER_LEN] = {0};
1603 size_t offset;
1604
1605 int ret = prepareImp(&buffer[0],
1606 k_BUFFER_LEN - 1,
1607 &offset,
1608 pattern,
1609 flags,
1610 jitStackSize);
1611
1612 if (ret) {
1613 if (errorMessage) {
1614 errorMessage->assign(&buffer[0]);
1615 }
1616 if (errorOffset) {
1617 *errorOffset = offset;
1618 }
1619 }
1620
1621 return ret;
1622}
1623
1624// ACCESSORS
1625inline
1627{
1628 return d_depthLimit;
1629}
1630
1631inline
1632int RegEx::flags() const
1633{
1634 return d_flags;
1635}
1636
1637inline
1639{
1640 return (0 != d_patternCode_p);
1641}
1642
1643inline
1645{
1646 return d_jitStackSize;
1647}
1648
1649inline
1651{
1652 return d_pattern;
1653}
1654
1655} // close package namespace
1656
1657
1658
1659#endif
1660
1661// ----------------------------------------------------------------------------
1662// Copyright 2016 Bloomberg Finance L.P.
1663//
1664// Licensed under the Apache License, Version 2.0 (the "License");
1665// you may not use this file except in compliance with the License.
1666// You may obtain a copy of the License at
1667//
1668// http://www.apache.org/licenses/LICENSE-2.0
1669//
1670// Unless required by applicable law or agreed to in writing, software
1671// distributed under the License is distributed on an "AS IS" BASIS,
1672// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1673// See the License for the specific language governing permissions and
1674// limitations under the License.
1675// ----------------------------- END-OF-FILE ----------------------------------
1676
1677/** @} */
1678/** @} */
1679/** @} */
Definition bdlpcre_regex.h:766
void namedSubpatterns(std::vector< std::pair< bsl::string_view, int > > *result) const
int matchRaw(std::vector< bsl::string_view > *result, const bsl::string_view &subject, size_t subjectOffset=0) const
int setDepthLimit(int depthLimit)
int match(bsl::string_view *result, const bsl::string_view &subject, size_t subjectOffset=0) const
int matchRaw(const bsl::string_view &subject, size_t subjectOffset=0) const
int matchRaw(bsl::vector< bsl::string_view > *result, const bsl::string_view &subject, size_t subjectOffset=0) const
static const size_t k_INVALID_OFFSET
Definition bdlpcre_regex.h:981
int match(bsl::pair< size_t, size_t > *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
int matchRaw(bsl::pair< size_t, size_t > *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
static bool isJitAvailable()
int matchRaw(bsl::string_view *result, const bsl::string_view &subject, size_t subjectOffset=0) const
int depthLimit() const
Definition bdlpcre_regex.h:1626
@ k_REPLACE_UNSET_EMPTY
Definition bdlpcre_regex.h:930
@ k_REPLACE_UNKNOWN_UNSET
Definition bdlpcre_regex.h:928
@ k_REPLACE_GLOBAL
Definition bdlpcre_regex.h:922
@ k_REPLACE_EXTENDED
Definition bdlpcre_regex.h:925
@ k_REPLACE_LITERAL
Definition bdlpcre_regex.h:920
int prepare(bsl::nullptr_t errorMessage, size_t *errorOffset, const char *pattern, int flags=0, size_t jitStackSize=0)
int replaceRaw(bsl::string *result, int *errorOffset, const bsl::string_view &subject, const bsl::string_view &replacement, size_t options=0) const
int matchRaw(bsl::string_view *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
int matchRaw(bsl::vector< bslstl::StringRef > *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
BSLMF_NESTED_TRAIT_DECLARATION(RegEx, bslma::UsesBslmaAllocator)
RegEx(bslma::Allocator *basicAllocator=0)
~RegEx()
Destroy this regular-expression object.
Definition bdlpcre_regex.h:1580
int match(bsl::vector< bsl::string_view > *result, const bsl::string_view &subject, size_t subjectOffset=0) const
int replaceRaw(std::string *result, int *errorOffset, const bsl::string_view &subject, const bsl::string_view &replacement, size_t options=0) const
int match(std::vector< bsl::string_view > *result, const bsl::string_view &subject, size_t subjectOffset=0) const
int flags() const
Definition bdlpcre_regex.h:1632
static int defaultDepthLimit()
Return the process-wide default evaluation recursion depth limit.
Definition bdlpcre_regex.h:1563
int match(bsl::vector< bslstl::StringRef > *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
const bsl::string & pattern() const
Definition bdlpcre_regex.h:1650
int match(const char *subject, size_t subjectLength, size_t subjectOffset=0) const
int matchRaw(bsl::vector< bsl::pair< size_t, size_t > > *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
size_t jitStackSize() const
Definition bdlpcre_regex.h:1644
static int setDefaultDepthLimit(int depthLimit)
Definition bdlpcre_regex.h:1569
void namedSubpatterns(bsl::vector< bsl::pair< bsl::string_view, int > > *result) const
int match(bsl::vector< bsl::pair< size_t, size_t > > *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
int match(bsl::string_view *result, const char *subject, size_t subjectLength, size_t subjectOffset=0) const
int match(const bsl::string_view &subject, size_t subjectOffset=0) const
int subpatternIndex(const char *name) const
bool isPrepared() const
Definition bdlpcre_regex.h:1638
@ k_FLAG_JIT
Definition bdlpcre_regex.h:909
@ k_FLAG_CASELESS
Definition bdlpcre_regex.h:900
@ k_FLAG_DUPNAMES
Definition bdlpcre_regex.h:912
@ k_FLAG_MULTILINE
Definition bdlpcre_regex.h:905
@ k_FLAG_UTF8
Definition bdlpcre_regex.h:907
@ k_FLAG_DOTMATCHESALL
Definition bdlpcre_regex.h:902
int replace(bsl::string *result, int *errorOffset, const bsl::string_view &subject, const bsl::string_view &replacement, size_t options=0) const
int numSubpatterns() const
int matchRaw(const char *subject, size_t subjectLength, size_t subjectOffset=0) const
@ k_STATUS_UTF8_TRUNCATED_CHARACTER_FAILURE
the UTF-8 string ends with a truncated UTF-8 character
Definition bdlpcre_regex.h:951
@ k_STATUS_DEPTH_LIMIT_FAILURE
depthLimit() was exceeded
Definition bdlpcre_regex.h:944
@ k_STATUS_UTF8_5_OR_6_BYTES_CHARACTER_FAILURE
a UTF-8 character is either 5 or 6 bytes long
Definition bdlpcre_regex.h:958
@ k_STATUS_UTF8_4_BYTES_CHARACTER_RANGE_FAILURE
a 4-byte UTF-8 character has a value greater than 0x10ffff
Definition bdlpcre_regex.h:961
@ k_STATUS_SUCCESS
successful completion of the operation
Definition bdlpcre_regex.h:938
@ k_STATUS_UTF8_FIRST_BYTE_WRONG_VALUE_FAILURE
the first byte of a UTF-8 character has the value 0xfe or 0xff
Definition bdlpcre_regex.h:976
@ k_STATUS_NO_MATCH
the subject string did not match the pattern
Definition bdlpcre_regex.h:941
@ k_STATUS_UTF8_SIGNIFICANT_BITS_VALUE_FAILURE
Definition bdlpcre_regex.h:955
@ k_STATUS_UTF8_3_BYTES_CHARACTER_RANGE_FAILURE
Definition bdlpcre_regex.h:965
@ k_STATUS_UTF8_FIRST_BYTE_SIGNIFICANT_BITS_FAILURE
Definition bdlpcre_regex.h:973
@ k_STATUS_UTF8_OVERLONG_CHARACTER_FAILURE
Definition bdlpcre_regex.h:969
@ k_STATUS_JIT_STACK_LIMIT_FAILURE
Definition bdlpcre_regex.h:948
int replace(std::string *result, int *errorOffset, const bsl::string_view &subject, const bsl::string_view &replacement, size_t options=0) const
Definition bslstl_stringview.h:441
Definition bslstl_string.h:1281
Definition bslstl_pair.h:1210
Definition bslstl_vector.h:1025
Definition bslma_allocator.h:457
Definition bslma_managedptr.h:1182
#define BSLS_IDENT(str)
Definition bsls_ident.h:195
Definition bdlpcre_regex.h:748
BloombergLP::bsls::Nullptr_Impl::Type nullptr_t
Definition bsls_nullptr.h:281
Definition bslmf_enableif.h:525
Definition bslmf_issame.h:146
Definition bslma_usesbslmaallocator.h:343
static void setIntRelaxed(AtomicTypes::Int *atomicInt, int value)
Definition bsls_atomicoperations.h:1552
static int getIntRelaxed(AtomicTypes::Int const *atomicInt)
Definition bsls_atomicoperations.h:1534