Outline
Purpose
Provide common reader protocol for parsing XML documents.
Classes
- balxml::Reader: protocol for fast, forward-only access to XML data stream
- See also
- balxml_validatingreader, balxml_elementattribute, balxml::ErrorInfo, balxml_prefixstack, balxml_namespaceregistry
Description
This component supplies an abstract class, balxml::Reader
that defines an interface for accessing a forward-only, read-only stream of XML data. The balxml::Reader
interface is somewhat similar to Microsoft XmlReader interface, which provides a simpler and more flexible programming model than the quasi-standard SAX/SAX2 model and a (potentially) more memory-efficient programming model than DOM. Access to the data is done in a cursor-like fashion, going forward on the document stream and stopping at each node along the way. A "node" is an XML syntactic construct such as the start of an element, the end of an element, element text, etc.. (See the balxml::Reader::NodeType
enumeration for a complete list.) Note that, unlike the Microsoft interace, an element attribute is not considered a node in this interface, but is rather considered an attribute of a start-element node. In the documentation below the "current node" refers to the node on which the reader is currently positioned. The client code advances through all of the nodes in the XML document by calling the advanceToNextNode
function repeatedly and processing each node in the order it appears in the xml document.
balxml::Reader
supplies accessors that query a node's attributes, such as the node's type, name, value, element attributes, etc.. Note that each call to advanceToNextNode
invalidates strings and data structures returned when the balxml::Reader
accessors were call for the prior node. E.g., the pointer returned from nodeName
for one node will not be valid once the reader has advanced to the next node. The fact that this interface provides so little prior context gives the derived-class implementations the potential to be very efficient in their use of memory.
Any derived class must adhere to the class-level and function-level contract documented in this component. Note that an object of a derived class implementation must be reusable such that, after parsing one document, the reader can be closed and re-opened to parse another document.
Node Type
An enumeration value that identifies a node as a specific XML construct, e.g., ELEMENT, END_ELEMENT, TEXT, CDATA, etc. (See the balxml::Reader::NodeType
enumeration for a complete list.)
Qualified and local names:
XML documents may contain some qualified names. These are names with a prefix (optional) and a local name, separated by a colon. (The colon is present only if the prefix is present.) The prefix is a (typically short) word that is associated with a namespace URI via a namespace declaration. The local name specifies an entity within the specified namespace or, if no prefix is given, within the default namespace. For each qualified name, the balxml::Reader
interface provides access to the entire qualified name and separate access to the prefix, the local name, the namespace URI, and the namespace ID.
Base URI
Networked XML documents may comprise chunks of data aggregated using various W3C standard inclusion mechanisms and can contain nodes that come from different places. DTD entities are an example of this. The base URI tells you where a node comes from (see http://www.w3.org/TR/xmlbase/). The base URI of an element is:
- The base URI specified by an xml:base attribute on the element, if one exists, otherwise
- The base URI of the element's parent element within the document or external entity, if one exists, otherwise
- The base URI of the document entity or external entity containing the element.
If there is no base URI for a node being returned (for example, it was parsed from an in-memory string), then nodeBaseUri
return an empty string.
Encoding
A XML document or any external reference (such as expanding an entity in a DTD file or reading a schema file) will be encoded, for example, in "ASCII," "UTF-8," or "UTF-16". The document can also contain self-describing information as to which encoding was used when the document was created. Note that the encoding returned from the documentEncoding
method can differ from the encoding of the strings returned from the balxml::Reader
accessors; all strings returned by these accessors are UTF-8 regardless of the encoding used in the original document.
If encoding information is not provided in the document, the balxml::Reader::open
method allows clients to specify an encoding to use. The encoding passed to balxml::Reader::open
will take effect only when there is no encoding information in the original document, i.e., the encoding information obtained from the original document trumps all. If there is no encoding provided within the document and the client has not provided one via the balxml::Reader::open
method, then a derived-class implementation should set the encoding to UTF-8. (See the balxml::Reader::open
method for more details.)
Thread Safety
This component does not provide any functions that present a thread safety issue, since the balxml::Reader
class is abstract and cannot be instantiated. There is no guarantee that any specific derived class will provide a thread-safe implementation.
Usage
This section illustrates intended use of this component.
Example 1: The protocol usage
The following string describes xml for a very simple user directory. The top level element contains one xml namespace attribute, with one embedded entry describing a user.
const char TEST_XML_STRING[] =
"<?xml version='1.0' encoding='UTF-8'?>\n"
"<directory-entry xmlns:dir='http://bloomberg.com/schemas/directory'>\n"
" <name>John Smith</name>\n"
" <phone dir:phonetype='cell'>212-318-2000</phone>\n"
" <address/>\n"
"</directory-entry>\n";
Suppose we need to extract the name of the user and his cellphone number from this entry. In order to read the XML, we first need to construct a balxml::NamespaceRegistry
object, a balxml::PrefixStack
object, and a TestReader
object, where TestReader
is an implementation of balxml::Reader
.
TestReader testReader;
Definition balxml_namespaceregistry.h:181
Definition balxml_prefixstack.h:137
Definition balxml_reader.h:835
The reader uses a balxml::PrefixStack
to manage namespace prefixes. Installing a stack for an open reader leads to undefined behavior. So, we want to ensure that our reader is not open before installation.
assert(
false == reader.
isOpen());
virtual void setPrefixStack(PrefixStack *prefixes)=0
virtual bool isOpen() const =0
virtual PrefixStack * prefixStack() const =0
Next, we call the open
method to setup the reader for parsing using the data contained in the XML string.
reader.
open(TEST_XML_STRING,
sizeof(TEST_XML_STRING) -1, 0,
"UTF-8");
virtual int open(const char *filename, const char *encoding=0)=0
Confirm that the bdem::Reader
has opened properly.
assert(
true == reader.
isOpen());
Then, iterate through the nodes to find the elements that are interesting to us. First, we'll find the user's name:
int rc = 0;
do {
assert(0 == rc);
}
while (bsl::strcmp(reader.
nodeName(),
"name"));
assert(0 == rc);
virtual const char * nodeName() const =0
virtual int advanceToNextNode()=0
virtual NodeType nodeType() const =0
virtual int nodeDepth() const =0
virtual const char * nodeValue() const =0
@ e_NODE_TYPE_TEXT
Definition balxml_reader.h:846
virtual bool nodeHasValue() const =0
Return true if the current node has a value and false otherwise.
Definition bslstl_string.h:1281
basic_string & assign(const basic_string &replacement)
Definition bslstl_string.h:5716
Next, advance to the user's phone number:
do {
assert(0 == rc);
}
while (bsl::strcmp(reader.
nodeName(),
"phone"));
assert(0 == rc);
assert(
false == elemAttr.
isNull());
if (!bsl::strcmp(elemAttr.
value(),
"cell")) {
assert(0 == rc);
}
Definition balxml_elementattribute.h:289
const char * value() const
Definition balxml_elementattribute.h:523
bool isNull() const
Definition balxml_elementattribute.h:535
virtual bool isEmptyElement() const =0
virtual int lookupAttribute(ElementAttribute *attribute, int index) const =0
virtual int numAttributes() const =0
Now, verify the extracted data:
assert("John Smith" == name);
assert("212-318-2000" == number);
Finally, close the reader:
assert(
false == reader.
isOpen());
Example 2: The protocol implementation
We have to implement all pure virtual functions of the balxml::Reader
protocol, but to make the example easier to read and shorter we will stub some methods. Moreover, we will provide fake implementations of the methods used in this example, so our implementation will not handle the given XML fragment, but iterate through some supposititious XML structure.
First, let's introduce an array of "helper" structs. This array will be filled in with data capable of describing the information contained in the user directory XML above:
struct TestNode {
struct Attribute {
const char *d_qname;
const char *d_value;
};
enum {
k_NUM_ATTRIBUTES = 5
};
const char *d_qname;
const char *d_nodeValue;
int d_depthChange;
bool d_isEmpty;
Attribute d_attributes[k_NUM_ATTRIBUTES];
};
static const TestNode fakeDocument[] = {
0 , 0 , 0,
false, {} },
"xml" , "version='1.0' encoding='UTF-8'", +1,
false, {} },
"directory-entry" , 0 , 0,
false, {"xmlns:dir" , "http://bloomberg.com/schemas/directory"} },
"name" , 0 , +1,
false, {} },
0 , "John Smith" , +1,
false, {} },
"name" , 0 , -1,
false, {} },
"phone" , 0 , 0,
false, {"dir:phonetype", "cell"} },
0 , "212-318-2000" , +1,
false, {} },
"phone" , 0 , -1,
false, {} },
"address" , 0 , 0,
true, {} },
"directory-entry", 0 , -1,
false, {} },
0 , 0 , 0,
false, {} },
};
NodeType
Definition balxml_reader.h:839
@ e_NODE_TYPE_ELEMENT
Definition balxml_reader.h:845
@ e_NODE_TYPE_XML_DECLARATION
Definition balxml_reader.h:860
@ e_NODE_TYPE_NONE
Definition balxml_reader.h:844
@ e_NODE_TYPE_END_ELEMENT
Definition balxml_reader.h:858
Now, create a class that implements the balxml::Reader
interface. Note that documentation for class methods is omitted to reduce the text of the usage example. If necessary, it can be seen in the balxml::Reader
class declaration.
private:
XmlResolverFunctor d_resolver;
bool d_isOpen;
int d_nodeDepth;
const TestNode *d_currentNode;
void setEncoding(const char *encoding);
void adjustPrefixStack();
public:
TestReader();
virtual ~TestReader();
virtual int open(
const char *filename,
const char *encoding = 0);
virtual int open(
const char *buffer,
size_t size,
const char *url = 0,
const char *encoding = 0);
virtual int open(bsl::streambuf *stream,
const char *url = 0,
const char *encoding = 0);
int index) const;
const char *qname) const;
const char *localName,
const char *namespaceUri) const;
const char *localName,
int namespaceId) const;
virtual XmlResolverFunctor
resolver()
const;
virtual unsigned int options()
const;
};
inline
void TestReader::setEncoding(const char *encoding)
{
d_encoding =
(0 == encoding || '\0' == encoding[0]) ? "UTF-8" : encoding;
}
inline
void TestReader::adjustPrefixStack()
{
for (int ii = 0; ii < TestNode::k_NUM_ATTRIBUTES; ++ii) {
const char *prefix = d_currentNode->d_attributes[ii].d_qname;
if (!prefix || bsl::strncmp("xmlns", prefix, 5)) {
continue;
}
if (':' == prefix[5]) {
d_prefixes->pushPrefix(
prefix + 6, d_currentNode->d_attributes[ii].d_value);
}
else {
d_prefixes->pushPrefix(
"", d_currentNode->d_attributes[ii].d_value);
}
}
}
d_prefixes->reset();
}
}
TestReader::TestReader()
: d_errorInfo()
, d_prefixes(0)
, d_resolver()
, d_isOpen(false)
, d_encoding()
, d_nodeDepth(0)
, d_currentNode(0)
{
}
TestReader::~TestReader()
{
}
void TestReader::setResolver(XmlResolverFunctor resolver)
{
d_resolver = resolver;
}
{
assert(!d_isOpen);
d_prefixes = prefixes;
}
int TestReader::open(const char * ,
const char * )
{
return -1;
}
int TestReader::open(const char * ,
size_t ,
const char * ,
const char *encoding)
{
if (d_isOpen) {
return false;
}
d_isOpen = true;
d_nodeDepth = 0;
Definition balxml_errorinfo.h:353
virtual const char * nodeNamespaceUri() const =0
virtual const char * nodeLocalName() const =0
virtual void setOptions(unsigned int flags)=0
virtual unsigned int options() const =0
Return the option flags.
virtual XmlResolverFunctor resolver() const =0
Return the external XML resource resolver.
virtual void setResolver(XmlResolverFunctor resolver)=0
virtual const char * nodeBaseUri() const =0
virtual const char * nodePrefix() const =0
virtual const char * documentEncoding() const =0
virtual const ErrorInfo & errorInfo() const =0
virtual int getLineNumber() const =0
virtual int getColumnNumber() const =0
virtual int nodeNamespaceId() const =0
Note that we do not use the supplied buffer, but direct the internal iterator to the fake structure:
d_currentNode = fakeDocument;
setEncoding(encoding);
return 0;
}
int TestReader::open(bsl::streambuf * ,
const char * ,
const char * )
{
return -1;
}
void TestReader::close()
{
if (d_prefixes) {
d_prefixes->reset();
}
d_isOpen = false;
d_encoding.clear();
d_nodeDepth = 0;
d_currentNode = 0;
}
int TestReader::advanceToNextNode()
{
if (!d_currentNode) {
return -1;
}
const TestNode *nextNode = d_currentNode + 1;
d_prefixes->reset();
return 1;
}
d_currentNode = nextNode;
if (d_prefixes && 1 == d_nodeDepth) {
adjustPrefixStack();
}
d_nodeDepth += d_currentNode->d_depthChange;
return 0;
}
int index) const
{
if (!d_currentNode ||
index < 0 ||
index >= TestNode::k_NUM_ATTRIBUTES) {
return 1;
}
const char *qname = d_currentNode->d_attributes[index].d_qname;
if ('\0' == qname[0]) {
return 1;
}
d_prefixes, qname, d_currentNode->d_attributes[index].d_value);
return 0;
}
int TestReader::lookupAttribute(
const char * ) const
{
return -1;
}
int TestReader::lookupAttribute(
const char * ,
const char * ) const
{
return -1;
}
int TestReader::lookupAttribute(
const char * ,
int ) const
{
return -1;
}
void TestReader::setOptions(unsigned int )
{
return;
}
const char *TestReader::documentEncoding() const
{
return d_encoding.c_str();
}
TestReader::XmlResolverFunctor TestReader::resolver() const
{
return d_resolver;
}
bool TestReader::isOpen() const
{
return d_isOpen;
}
{
return d_errorInfo;
}
int TestReader::getLineNumber() const
{
return 0;
}
int TestReader::getColumnNumber() const
{
return 0;
}
{
return d_prefixes;
}
TestReader::NodeType TestReader::nodeType() const
{
if (!d_currentNode || !d_isOpen) {
return e_NODE_TYPE_NONE;
}
return d_currentNode->d_type;
}
const char *TestReader::nodeName() const
{
if (!d_currentNode || !d_isOpen) {
return 0;
}
return d_currentNode->d_qname;
}
const char *TestReader::nodeLocalName() const
{
if (!d_currentNode || !d_isOpen) {
return 0;
}
return d_currentNode->d_qname;
}
const char *TestReader::nodePrefix() const
{
return "";
}
int TestReader::nodeNamespaceId() const
{
return -1;
}
const char *TestReader::nodeNamespaceUri() const
{
return "";
}
const char *TestReader::nodeBaseUri() const
{
return "";
}
bool TestReader::nodeHasValue() const
{
if (!d_currentNode || !d_isOpen) {
return false;
}
if (0 == d_currentNode->d_nodeValue) {
return false;
}
return ('\0' != d_currentNode->d_nodeValue[0]);
}
const char *TestReader::nodeValue() const
{
if (!d_currentNode || !d_isOpen) {
return 0;
}
return d_currentNode->d_nodeValue;
}
int TestReader::nodeDepth() const
{
return d_nodeDepth;
}
int TestReader::numAttributes() const
{
for (int index = 0; index < TestNode::k_NUM_ATTRIBUTES; ++index) {
if (0 == d_currentNode->d_attributes[index].d_qname) {
return index;
}
}
return TestNode::k_NUM_ATTRIBUTES;
}
bool TestReader::isEmptyElement() const
{
return d_currentNode->d_isEmpty;
}
unsigned int TestReader::options() const
{
return 0;
}
void reset()
Reset this object to the default-constructed state.
Finally, our implementation of balxml::Reader
is complete. We may use this implementation as the TestReader
in the first example.