#include <bdljsn_tokenizer.h>

Public Types
enum	TokenType { e_BEGIN = 1 , e_ELEMENT_NAME , e_START_OBJECT , e_END_OBJECT , e_START_ARRAY , e_END_ARRAY , e_ELEMENT_VALUE , e_ERROR , BAEJSN_ELEMENT_NAME = e_ELEMENT_NAME , BAEJSN_START_OBJECT = e_START_OBJECT , BAEJSN_END_OBJECT = e_END_OBJECT , BAEJSN_START_ARRAY = e_START_ARRAY , BAEJSN_END_ARRAY = e_END_ARRAY , BAEJSN_ELEMENT_VALUE = e_ELEMENT_VALUE , BAEJSN_ERROR = e_ERROR }

enum	{ k_EOF = +1 }

enum	ConformanceMode { e_RELAXED = 0 , e_STRICT_20240119 }

typedef bsls::Types::IntPtr	IntPtr

typedef bsls::Types::Uint64	Uint64

Public Member Functions
	Tokenizer (bslma::Allocator *basicAllocator=0)

	~Tokenizer ()
	Destroy this object.

int	advanceToNextToken ()

void	reset (bsl::streambuf *streambuf)

int	resetStreamBufGetPointer ()

Tokenizer &	setAllowConsecutiveSeparators (bool value)

Tokenizer &	setAllowFormFeedAsWhitespace (bool value)

Tokenizer &	setAllowHeterogenousArrays (bool value)

Tokenizer &	setAllowNonUtf8StringLiterals (bool value)

Tokenizer &	setAllowStandAloneValues (bool value)

Tokenizer &	setAllowTrailingTopLevelComma (bool value)

Tokenizer &	setAllowUnescapedControlCharacters (bool value)

Tokenizer &	setConformanceMode (ConformanceMode mode)

bool	allowConsecutiveSeparators () const

bool	allowFormFeedAsWhitespace () const

bool	allowHeterogenousArrays () const

bool	allowNonUtf8StringLiterals () const

bool	allowStandAloneValues () const

bool	allowTrailingTopLevelComma () const

bool	allowUnescapedControlCharacters () const

ConformanceMode	conformanceMode () const
	Return the `conformanceMode` of this tokenizer.

bsls::Types::Uint64	currentPosition () const

bsls::Types::Uint64	readOffset () const

int	readStatus () const

TokenType	tokenType () const
	Return the token type of the current token.

int	value (bsl::string_view *data) const

Detailed Description

This class provides a mechanism for traversing JSON data stored in a bsl::streambuf one node at a time and allows clients to access the data associated with that node, including its type and data value.

See bdljsn_tokenizer

Member Typedef Documentation

◆ IntPtr

typedef bsls::Types::IntPtr bdljsn::Tokenizer::IntPtr

◆ Uint64

typedef bsls::Types::Uint64 bdljsn::Tokenizer::Uint64

Member Enumeration Documentation

◆ anonymous enum

anonymous enum

Enumerator
k_EOF

◆ ConformanceMode

enum bdljsn::Tokenizer::ConformanceMode

Enumerator
e_RELAXED
e_STRICT_20240119

◆ TokenType

enum bdljsn::Tokenizer::TokenType

Enumerator
e_BEGIN
e_ELEMENT_NAME
e_START_OBJECT
e_END_OBJECT
e_START_ARRAY
e_END_ARRAY
e_ELEMENT_VALUE
e_ERROR
BAEJSN_ELEMENT_NAME
BAEJSN_START_OBJECT
BAEJSN_END_OBJECT
BAEJSN_START_ARRAY
BAEJSN_END_ARRAY
BAEJSN_ELEMENT_VALUE
BAEJSN_ERROR

Constructor & Destructor Documentation

◆ Tokenizer()

bdljsn::Tokenizer::Tokenizer ( bslma::Allocator * basicAllocator = 0 )

inlineexplicit

Create a Tokenizer object. Optionally specify a basicAllocator used to supply memory. If basicAllocator is 0, the currently installed default allocator is used. By default, the conformanceMode is e_RELAXED and the value of the Tokenizer options are:

allowConsecutiveSeparators()      == true;
allowFormFeedAsWhitespace()       == true;
allowHeterogeneousArrays()        == true;
allowNonUtf8StringLiterals()      == true;
allowStandAloneValues()           == true;
allowTrailingTopLevelComma()      == true;
allowUnescapedControlCharacters() == true;

The reset method must be called before any calls to advanceToNextToken or resetStreamBufGetPointer.

◆ ~Tokenizer()

bdljsn::Tokenizer::~Tokenizer ( )

inline

Member Function Documentation

◆ advanceToNextToken()

int bdljsn::Tokenizer::advanceToNextToken ( )

Move to the next token in the data steam. Return 0 on success and a non-zero value otherwise. Each call to advanceToNextToken invalidates the string references returned by the value accessor for prior nodes. This function may fail to move to the next token if doing so would advanced past a character sequence that is not valid JSON, and is guaranteed to do so (fail to move) if e_RELAXED != conformanceMode(). The behavior is undefined unless reset has been called.

◆ allowConsecutiveSeparators()

bool bdljsn::Tokenizer::allowConsecutiveSeparators ( ) const

inline

Return the value of the allowConsecutiveSeparators option of this tokenizer.

◆ allowFormFeedAsWhitespace()

bool bdljsn::Tokenizer::allowFormFeedAsWhitespace ( ) const

inline

Return the value of the allowFormFeedAsWhitespace option of this tokenizer.

◆ allowHeterogenousArrays()

bool bdljsn::Tokenizer::allowHeterogenousArrays ( ) const

inline

Return the value of the allowHeterogenousArrays option of this tokenizer.

◆ allowNonUtf8StringLiterals()

bool bdljsn::Tokenizer::allowNonUtf8StringLiterals ( ) const

inline

Return the value of the allowNonUtf8StringLiterals option of this tokenizer.

◆ allowStandAloneValues()

bool bdljsn::Tokenizer::allowStandAloneValues ( ) const

inline

Return the value of the allowStandAloneValues option of this tokenizer.

◆ allowTrailingTopLevelComma()

bool bdljsn::Tokenizer::allowTrailingTopLevelComma ( ) const

inline

Return the value of the allowTrailingTopLevelComma option of this tokenizer.

◆ allowUnescapedControlCharacters()

bool bdljsn::Tokenizer::allowUnescapedControlCharacters ( ) const

inline

Return the value of the allowUnescapedControlCharacters option of this tokenizer.

◆ conformanceMode()

Tokenizer::ConformanceMode bdljsn::Tokenizer::conformanceMode ( ) const

inline

◆ currentPosition()

bsls::Types::Uint64 bdljsn::Tokenizer::currentPosition ( ) const

inline

Return the offset of the current octet being tokenized in the stream supplied to reset, or if an error occurred, the position where the failed attempt to tokenize a token occurred. Note that this operation is intended to provide additional information in the case of an error.

◆ readOffset()

bsls::Types::Uint64 bdljsn::Tokenizer::readOffset ( ) const

inline

Return the last read position relative to when reset was called. Note that readOffset() >= currentPosition() – the readOffset is the offset of the last octet read from the stream supplied to reset, and is at or beyond the current position being tokenized.

◆ readStatus()

int bdljsn::Tokenizer::readStatus ( ) const

inline

Return the status of the last call to reloadStringBuffer():

0 if reloadStringBuffer() has not been called or if a token was successfully read.
k_EOF (which is positive) if no data could be read before reaching EOF.
a negative value if the allowNonUtf8StringLiterals option is false and a UTF-8 error occurred. The specific value returned will be one of the enumerators of the bdlde::Utf8Util::ErrorStatus enum type indicating the nature of the UTF-8 error.

◆ reset()

void bdljsn::Tokenizer::reset ( bsl::streambuf * streambuf )

inline

Reset this tokenizer to read data from the specified streambuf. Note that the reader will not be on a valid node until advanceToNextToken is called. Note that this function does not change the the conformanceMode nor the values of any of the individual token options:

allowConsecutiveSeparators
allowFormFeedAsWhitespace
allowHeterogenousArrays
allowNonUtf8StringLiterals
allowStandAloneValues
allowTrailingTopLevelComma
allowUnescapedControlCharacters

◆ resetStreamBufGetPointer()

int bdljsn::Tokenizer::resetStreamBufGetPointer ( )

Reset the get pointer of the streambuf held by this object to refer to the byte following the last processed byte, if the held streambuf supports seeking, and return an error otherwise leaving this object unchanged. Return 0 on success, and a non-zero value otherwise. The behavior is undefined unless reset has been called. Note that after a successful function return users can read data from the streambuf that was specified during reset from where this object stopped. Also note that this call implies the end of processing for this object and any subsequent methods invoked on this object should only be done after calling reset and specifying a new streambuf.

◆ setAllowConsecutiveSeparators()

Tokenizer & bdljsn::Tokenizer::setAllowConsecutiveSeparators ( bool value )

inline

Set the allowConsecutiveSeparators option to the specified value and return a non-const reference to this tokenizer. JSON defines two separator tokens: the colon (:) and the comma (,). If the allowConsecutiveSeparartors value is true this tokenizer will accept multiple consecutive sequences of a given separator (e.g., "a"::b, "c":::d and "a":b,, "c":d, ,, "e":f') as if a single separator had appeared (i.e., "a":b, "c":d and "a":b, "c":d, "e":f', respectively). Otherwise the tokenizer returns an error when multiple consecutive colons are found. By default, the value of the allo ConsecutiveSeparators option is true. The behavior is undefined unless e_RELAXED == conformanceMode(). Note that consecutive sequences using both tokens (e.g., ::,,::) is always an error.

◆ setAllowFormFeedAsWhitespace()

Tokenizer & bdljsn::Tokenizer::setAllowFormFeedAsWhitespace ( bool value )

inline

Set the allowFormFeedAsWhitespace option to the specifiedd value and return a non-const reference to this tokenizer. If the allowFormFeedAsWhitespace value is true the formfeed character ('\f') is recognized as a whitespace character in addition to '
', '\t', '\r', and '\v'. Otherwise, formfeed is diallowed a whitewpace.

◆ setAllowHeterogenousArrays()

Tokenizer & bdljsn::Tokenizer::setAllowHeterogenousArrays ( bool value )

inline

Set the allowHeterogenousArrays option to the specified value and return a non-const reference to this tokenizer. If the allowHeterogenousArrays value is true this tokenizer will successfully tokenize heterogeneous values within an array. If the option's value is false then the tokenizer will return an error for arrays having heterogeneous values. By default, the value of the allowHeterogenousArrays option is true. The behavior is undefined unless e_RELAXED == conformanceMode().

◆ setAllowNonUtf8StringLiterals()

Tokenizer & bdljsn::Tokenizer::setAllowNonUtf8StringLiterals ( bool value )

inline

Set the allowNonUtf8StringLiterals option to the specified value and return a non-const reference to this tokenizer. If the allowNonUtf8StringLiterals value is false this tokenizer will check string literal tokens for invalid UTF-8, enter an error mode if it encounters a string literal token that has any content that is not UTF-8, and fail to advance to subsequent tokens until reset is called. By default, the value of the allowNonUtf8StringLiterals option is true. The behavior is undefined unless e_RELAXED == conformanceMode().

◆ setAllowStandAloneValues()

Tokenizer & bdljsn::Tokenizer::setAllowStandAloneValues ( bool value )

inline

Set the allowStandAloneValues option to the specified value and return a non-const reference to this tokenizer. If the allowStandAloneValues value is true this tokenizer will successfully tokenize JSON values (strings and numbers). If the option's value is false then the tokenizer will only tokenize complete JSON documents (JSON objects and arrays) and return an error for stand alone JSON values. By default, the value of the allowStandAloneValues option is true. The behavior is undefined unless e_RELAXED == conformanceMode().

◆ setAllowTrailingTopLevelComma()

Tokenizer & bdljsn::Tokenizer::setAllowTrailingTopLevelComma ( bool value )

inline

Set the allowTrailingTopLevelComma option to the specified value and return a non-const reference to this tokenizer. If the allowTrailingTopLevelComma value is true this tokenizer will successfully tokenize JSON values where a comma follows the top-level JSON element. If the option's value is false then the tokenizer will reject documents with such trailing commas, such as {},. By default, the value of the allowTrailingTopLevelComma option is true for backwards compatibility. Note that a document without any JSON elements is invalid whether or not it contains commas. The behavior is undefined unless e_RELAXED == conformanceMode().

◆ setAllowUnescapedControlCharacters()

Tokenizer & bdljsn::Tokenizer::setAllowUnescapedControlCharacters ( bool value )

inline

Set the allowUnescapedControlCharacters option of this tokenizer to the specified value. If true, characters in the range [ 0x00 .. 0x1F ] are allowed in JSON strings. If the option is false, these characters must be represented by their six byte escape sequences [ \u0000 .. \u001F ]. Several values in that range are also (conveniently) represented by two byte sequences:

\"    quotation mark
\\    reverse solidus
\/    solidus
\b    backspace
\f    form feed
\n    line feed
\r    carriage return
\t    tab

The DEL control character (0x7F) is accepted even in strict mode.

The behavior is undefined unless e_RELAXED == conformanceMode(). Note that the representation of these byte sequences as C/C++ string literals requires that the escape character itself must be escaped:

"Hello,\\tworld\\n";  // Can alwas initialize a JSON string with
                      // containing tab and a newline
                      // escape sequences
                      // whether the option is set or not.
 
"Hello,\tworld\n";    // When this option is 'true'.
                      // can also initialize a JSON string
                      // with an actual and newline characters.

Also note that the two resulting strings do not compare equal.

◆ setConformanceMode()

Tokenizer & bdljsn::Tokenizer::setConformanceMode ( Tokenizer::ConformanceMode mode )

inline

Set the conformanceMode of this tokenizer to the specified mode and return a non-const reference to this tokenizer. If mode is e_STRICT_20240119 the option values of this tokenizer are set to be fully compliant with RFC8259 (see https://www.rfc-editor.org/rfc/rfc8259)

Specifically, those option values are:

allowConsecutiveSeparartor       == false;
allowFormFeedAsWhitespace()      == false;
allowHeterogeneousArrays()       == true;
allowNonUtf8StringLiterals()     == false;
allowStandAloneValues()          == true;
allowTrailingTopLevelComma()     == false;
allowUnescapedControlCharacters() = false;

Otherwise (i.e., mode is e_RELAXED), those option values can be set in any combination. Note that the behavior is undefined if individual options are set when conformanceMode is not e_RELAXED.

◆ tokenType()

Tokenizer::TokenType bdljsn::Tokenizer::tokenType ( ) const

inline

◆ value()

int bdljsn::Tokenizer::value ( bsl::string_view * data ) const

Load into the specified data the value of the specified token if the current token's type is e_ELEMENT_NAME or e_ELEMENT_VALUE or leave data unmodified otherwise. Return 0 on success and a non-zero value otherwise. Note that the returned data is only valid until the next manipulator call on this object.

The documentation for this class was generated from the following file:

bdljsn_tokenizer.h

Public Types

Public Member Functions

Detailed Description

Member Typedef Documentation

◆ IntPtr

◆ Uint64

Member Enumeration Documentation

◆ anonymous enum

◆ ConformanceMode

◆ TokenType

Constructor & Destructor Documentation

◆ Tokenizer()

◆ ~Tokenizer()

Member Function Documentation

◆ advanceToNextToken()

◆ allowConsecutiveSeparators()

◆ allowFormFeedAsWhitespace()

◆ allowHeterogenousArrays()

◆ allowNonUtf8StringLiterals()

◆ allowStandAloneValues()

◆ allowTrailingTopLevelComma()

◆ allowUnescapedControlCharacters()

◆ conformanceMode()

◆ currentPosition()

◆ readOffset()

◆ readStatus()

◆ reset()

◆ resetStreamBufGetPointer()

◆ setAllowConsecutiveSeparators()

◆ setAllowFormFeedAsWhitespace()

◆ setAllowHeterogenousArrays()

◆ setAllowNonUtf8StringLiterals()

◆ setAllowStandAloneValues()

◆ setAllowTrailingTopLevelComma()

◆ setAllowUnescapedControlCharacters()

◆ setConformanceMode()

◆ tokenType()

◆ value()