Quick Links:

bal | bbl | bdl | bsl

Namespaces

Component balxml_reader
[Package balxml]

Provide common reader protocol for parsing XML documents. More...

Namespaces

namespace  balxml

Detailed Description

Outline
Purpose:
Provide common reader protocol for parsing XML documents.
Classes:
balxml::Reader protocol for fast, forward-only access to XML data stream
See also:
Component balxml_validatingreader, Component balxml_elementattribute, balxml::ErrorInfo, Component balxml_prefixstack, Component balxml_namespaceregistry
Description:
This component supplies an abstract class, balxml::Reader that defines an interface for accessing a forward-only, read-only stream of XML data. The balxml::Reader interface is somewhat similar to Microsoft XmlReader interface, which provides a simpler and more flexible programming model than the quasi-standard SAX/SAX2 model and a (potentially) more memory-efficient programming model than DOM. Access to the data is done in a cursor-like fashion, going forward on the document stream and stopping at each node along the way. A "node" is an XML syntactic construct such as the start of an element, the end of an element, element text, etc.. (See the balxml::Reader::NodeType enumeration for a complete list.) Note that, unlike the Microsoft interace, an element attribute is not considered a node in this interface, but is rather considered an attribute of a start-element node. In the documentation below the "current node" refers to the node on which the reader is currently positioned. The client code advances through all of the nodes in the XML document by calling the advanceToNextNode function repeatedly and processing each node in the order it appears in the xml document.
balxml::Reader supplies accessors that query a node's attributes, such as the node's type, name, value, element attributes, etc.. Note that each call to advanceToNextNode invalidates strings and data structures returned when the balxml::Reader accessors were call for the prior node. E.g., the pointer returned from nodeName for one node will not be valid once the reader has advanced to the next node. The fact that this interface provides so little prior context gives the derived-class implementations the potential to be very efficient in their use of memory.
Any derived class must adhere to the class-level and function-level contract documented in this component. Note that an object of a derived class implementation must be reusable such that, after parsing one document, the reader can be closed and re-opened to parse another document.
Node Type:
An enumeration value that identifies a node as a specific XML construct, e.g., ELEMENT, END_ELEMENT, TEXT, CDATA, etc. (See the balxml::Reader::NodeType enumeration for a complete list.)
Qualified and local names:
XML documents may contain some qualified names. These are names with a prefix (optional) and a local name, separated by a colon. (The colon is present only if the prefix is present.) The prefix is a (typically short) word that is associated with a namespace URI via a namespace declaration. The local name specifies an entity within the specified namespace or, if no prefix is given, within the default namespace. For each qualified name, the balxml::Reader interface provides access to the entire qualified name and separate access to the prefix, the local name, the namespace URI, and the namespace ID.
Base URI:
Networked XML documents may comprise chunks of data aggregated using various W3C standard inclusion mechanisms and can contain nodes that come from different places. DTD entities are an example of this. The base URI tells you where a node comes from (see http://www.w3.org/TR/xmlbase/). The base URI of an element is:
  1. The base URI specified by an xml:base attribute on the element, if one exists, otherwise
  2. The base URI of the element's parent element within the document or external entity, if one exists, otherwise
  3. The base URI of the document entity or external entity containing the element.
If there is no base URI for a node being returned (for example, it was parsed from an in-memory string), then nodeBaseUri return an empty string.
Encoding:
A XML document or any external reference (such as expanding an entity in a DTD file or reading a schema file) will be encoded, for example, in "ASCII," "UTF-8," or "UTF-16". The document can also contain self-describing information as to which encoding was used when the document was created. Note that the encoding returned from the documentEncoding method can differ from the encoding of the strings returned from the balxml::Reader accessors; all strings returned by these accessors are UTF-8 regardless of the encoding used in the original document.
If encoding information is not provided in the document, the balxml::Reader::open method allows clients to specify an encoding to use. The encoding passed to balxml::Reader::open will take effect only when there is no encoding information in the original document, i.e., the encoding information obtained from the original document trumps all. If there is no encoding provided within the document and the client has not provided one via the balxml::Reader::open method, then a derived-class implementation should set the encoding to UTF-8. (See the balxml::Reader::open method for more details.)
Thread Safety:
This component does not provide any functions that present a thread safety issue, since the balxml::Reader class is abstract and cannot be instantiated. There is no guarantee that any specific derived class will provide a thread-safe implementation.
Usage:
This section illustrates intended use of this component.
Example 1: The protocol usage:
The following string describes xml for a very simple user directory. The top level element contains one xml namespace attribute, with one embedded entry describing a user.
  const char TEST_XML_STRING[] =
     "<?xml version='1.0' encoding='UTF-8'?>\n"
     "<directory-entry xmlns:dir='http://bloomberg.com/schemas/directory'>\n"
     "    <name>John Smith</name>\n"
     "    <phone dir:phonetype='cell'>212-318-2000</phone>\n"
     "    <address/>\n"
     "</directory-entry>\n";
Suppose we need to extract the name of the user and his cellphone number from this entry. In order to read the XML, we first need to construct a balxml::NamespaceRegistry object, a balxml::PrefixStack object, and a TestReader object, where TestReader is an implementation of balxml::Reader.
  balxml::NamespaceRegistry namespaces;
  balxml::PrefixStack       prefixStack(&namespaces);
  TestReader                testReader;
  balxml::Reader&           reader = testReader;
The reader uses a balxml::PrefixStack to manage namespace prefixes. Installing a stack for an open reader leads to undefined behavior. So, we want to ensure that our reader is not open before installation.
  assert(false == reader.isOpen());

  reader.setPrefixStack(&prefixStack);

  assert(&prefixStack == reader.prefixStack());
Next, we call the open method to setup the reader for parsing using the data contained in the XML string.
  reader.open(TEST_XML_STRING, sizeof(TEST_XML_STRING) -1, 0, "UTF-8");
Confirm that the bdem::Reader has opened properly.
  assert(true == reader.isOpen());
Then, iterate through the nodes to find the elements that are interesting to us. First, we'll find the user's name:
  int         rc = 0;
  bsl::string name;
  bsl::string number;

  do {
      rc = reader.advanceToNextNode();
      assert(0 == rc);
  } while (bsl::strcmp(reader.nodeName(), "name"));

  rc = reader.advanceToNextNode();

  assert(0                                == rc);
  assert(3                                == reader.nodeDepth());
  assert(balxml::Reader::e_NODE_TYPE_TEXT == reader.nodeType());
  assert(true                             == reader.nodeHasValue());

  name.assign(reader.nodeValue());
Next, advance to the user's phone number:
  do {
      rc = reader.advanceToNextNode();
      assert(0 == rc);
  } while (bsl::strcmp(reader.nodeName(), "phone"));

  assert(false == reader.isEmptyElement());
  assert(1     == reader.numAttributes());

  balxml::ElementAttribute elemAttr;

  rc = reader.lookupAttribute(&elemAttr, 0);
  assert(0     == rc);
  assert(false == elemAttr.isNull());

  if (!bsl::strcmp(elemAttr.value(), "cell")) {
      rc = reader.advanceToNextNode();

      assert(0                                == rc);
      assert(balxml::Reader::e_NODE_TYPE_TEXT == reader.nodeType());
      assert(true                             == reader.nodeHasValue());

      number.assign(reader.nodeValue());
  }
Now, verify the extracted data:
  assert("John Smith"   == name);
  assert("212-318-2000" == number);
Finally, close the reader:
  reader.close();
  assert(false == reader.isOpen());
Example 2: The protocol implementation:
We have to implement all pure virtual functions of the balxml::Reader protocol, but to make the example easier to read and shorter we will stub some methods. Moreover, we will provide fake implementations of the methods used in this example, so our implementation will not handle the given XML fragment, but iterate through some supposititious XML structure.
First, let's introduce an array of "helper" structs. This array will be filled in with data capable of describing the information contained in the user directory XML above:
  struct TestNode {
      // A struct that contains information capable of describing an XML
      // node.

      // TYPES
      struct Attribute {
          // This struct represents the qualified name and value of an XML
          // attribute.

          const char *d_qname;  // qualified name of the attribute
          const char *d_value;  // value of the attribute
      };

      enum {
          k_NUM_ATTRIBUTES = 5
      };

      // DATA
      balxml::Reader::NodeType  d_type;
          // type of the node

      const char               *d_qname;
          // qualified name of the node

      const char               *d_nodeValue;
          // value of the XML node (if it's null, 'hasValue()' returns
          // 'false')

      int                       d_depthChange;
          // adjustment for the depth level of 'TestReader', valid values are
          // -1, 0 or 1

      bool                      d_isEmpty;
          // flag indicating whether the element is empty

      Attribute d_attributes[k_NUM_ATTRIBUTES];
          // array of attributes
  };


  static const TestNode fakeDocument[] = {
      // 'fakeDocument' is an array of 'TestNode' objects, that will be used
      // by the 'TestReader' to traverse and describe the user directory XML
      // above.

      { balxml::Reader::e_NODE_TYPE_NONE,
        0                , 0                               ,  0,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_XML_DECLARATION,
        "xml"            , "version='1.0' encoding='UTF-8'", +1,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_ELEMENT,
        "directory-entry" , 0                              ,  0,
        false, {"xmlns:dir"    , "http://bloomberg.com/schemas/directory"} },

      { balxml::Reader::e_NODE_TYPE_ELEMENT,
        "name"           , 0                               , +1,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_TEXT,
        0                , "John Smith"                    , +1,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_END_ELEMENT,
        "name"           , 0                               , -1,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_ELEMENT,
        "phone"          , 0                               ,  0,
        false, {"dir:phonetype", "cell"}                                   },

      { balxml::Reader::e_NODE_TYPE_TEXT,
        0                , "212-318-2000"                  , +1,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_END_ELEMENT,
        "phone"          , 0                               , -1,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_ELEMENT,
        "address"       , 0                                ,  0,
        true,  {}                                                          },

      { balxml::Reader::e_NODE_TYPE_END_ELEMENT,
        "directory-entry", 0                               , -1,
        false, {}                                                          },

      { balxml::Reader::e_NODE_TYPE_NONE,
        0                , 0                               ,  0,
        false, {}                                                          },
  };
Now, create a class that implements the balxml::Reader interface. Note that documentation for class methods is omitted to reduce the text of the usage example. If necessary, it can be seen in the balxml::Reader class declaration.
                                // ================
                                // class TestReader
                                // ================

  class TestReader : public balxml::Reader {
    private:
      // DATA
      balxml::ErrorInfo    d_errorInfo;    // current error information

      balxml::PrefixStack *d_prefixes;     // prefix stack (held, not owned)

      XmlResolverFunctor   d_resolver;     // place holder, not actually used

      bool                 d_isOpen;       // flag indicating whether the
                                           // reader is open

      bsl::string          d_encoding;     // document encoding

      int                  d_nodeDepth;    // level of the current node

      const TestNode      *d_currentNode;  // node being handled (held, not
                                           // owned)

      // PRIVATE CLASS METHODS
      void setEncoding(const char *encoding);
      void adjustPrefixStack();

    public:
      // CREATORS
      TestReader();
      virtual ~TestReader();

      // MANIPULATORS
      virtual void setResolver(XmlResolverFunctor resolver);

      virtual void setPrefixStack(balxml::PrefixStack *prefixes);

      virtual int open(const char *filename, const char *encoding = 0);
      virtual int open(const char *buffer,
                       size_t      size,
                       const char *url = 0,
                       const char *encoding = 0);
      virtual int open(bsl::streambuf *stream,
                       const char     *url = 0,
                       const char     *encoding = 0);

      virtual void close();

      virtual int advanceToNextNode();

      virtual int lookupAttribute(balxml::ElementAttribute *attribute,
                                  int                       index) const;
      virtual int lookupAttribute(balxml::ElementAttribute *attribute,
                                  const char               *qname) const;
      virtual int lookupAttribute(
                               balxml::ElementAttribute *attribute,
                               const char               *localName,
                               const char               *namespaceUri) const;
      virtual int lookupAttribute(
                                balxml::ElementAttribute *attribute,
                                const char               *localName,
                                int                       namespaceId) const;

      virtual void setOptions(unsigned int flags);

      // ACCESSORS
      virtual const char *documentEncoding() const;
      virtual XmlResolverFunctor resolver() const;
      virtual bool isOpen() const;
      virtual const balxml::ErrorInfo& errorInfo() const;
      virtual int getLineNumber() const;
      virtual int getColumnNumber() const;
      virtual balxml::PrefixStack *prefixStack() const;
      virtual NodeType nodeType() const;
      virtual const char *nodeName() const;
      virtual const char *nodeLocalName() const;
      virtual const char *nodePrefix() const;
      virtual int nodeNamespaceId() const;
      virtual const char *nodeNamespaceUri() const;
      virtual const char *nodeBaseUri() const;
      virtual bool nodeHasValue() const;
      virtual const char *nodeValue() const;
      virtual int nodeDepth() const;
      virtual int numAttributes() const;
      virtual bool isEmptyElement() const;
      virtual unsigned int options() const;
  };

                                // ----------------
                                // class TestReader
                                // ----------------

  // PRIVATE CLASS METHODS
  inline
  void TestReader::setEncoding(const char *encoding)
  {
      d_encoding =
                 (0 == encoding || '\0' == encoding[0]) ? "UTF-8" : encoding;
  }

  inline
  void TestReader::adjustPrefixStack()
  {
      // Each time this object reads a 'e_NODE_TYPE_ELEMENT' node, it must
      // push a namespace prefix onto the prefix stack to handle in-scope
      // namespace calculations that happen inside XML documents where inner
      // namespaces can override outer ones.

      if (balxml::Reader::e_NODE_TYPE_ELEMENT == d_currentNode->d_type) {
          for (int ii = 0; ii < TestNode::k_NUM_ATTRIBUTES; ++ii) {
              const char *prefix = d_currentNode->d_attributes[ii].d_qname;

              if (!prefix || bsl::strncmp("xmlns", prefix, 5)) {
                  continue;
              }

              if (':' == prefix[5]) {
                  d_prefixes->pushPrefix(
                      prefix + 6, d_currentNode->d_attributes[ii].d_value);
              }
              else {
                  // default namespace
                  d_prefixes->pushPrefix(
                      "", d_currentNode->d_attributes[ii].d_value);
              }
          }
      }
      else if (balxml::Reader::e_NODE_TYPE_NONE == d_currentNode->d_type) {
          d_prefixes->reset();
      }
  }

  // PUBLIC CREATORS
  TestReader::TestReader()
  : d_errorInfo()
  , d_prefixes(0)
  , d_resolver()
  , d_isOpen(false)
  , d_encoding()
  , d_nodeDepth(0)
  , d_currentNode(0)
  {
  }

  TestReader::~TestReader()
  {
  }

  // MANIPULATORS
  void TestReader::setResolver(XmlResolverFunctor resolver)
  {
      d_resolver = resolver;
  }

  void TestReader::setPrefixStack(balxml::PrefixStack *prefixes)
  {
      assert(!d_isOpen);

      d_prefixes = prefixes;
  }

  int TestReader::open(const char * /* filename */,
                       const char * /* encoding */)
  {
      return -1;  // STUB
  }

  int TestReader::open(const char * /* buffer */,
                       size_t       /* size */,
                       const char * /* url */,
                       const char *encoding)
  {
      if (d_isOpen) {
          return false;                                             // RETURN
      }
      d_isOpen    = true;
      d_nodeDepth = 0;
Note that we do not use the supplied buffer, but direct the internal iterator to the fake structure:
      d_currentNode = fakeDocument;

      setEncoding(encoding);
      return 0;
  }

  int TestReader::open(bsl::streambuf * /* stream */,
                       const char     * /* url */,
                       const char     * /* encoding */)
  {
      return -1;  // STUB
  }

  void TestReader::close()
  {
      if (d_prefixes) {
          d_prefixes->reset();
      }

      d_isOpen = false;
      d_encoding.clear();
      d_nodeDepth   = 0;
      d_currentNode = 0;
  }

  int TestReader::advanceToNextNode()
  {
      if (!d_currentNode) {
          return -1;                                                // RETURN
      }

      const TestNode *nextNode = d_currentNode + 1;

      if (balxml::Reader::e_NODE_TYPE_NONE == nextNode->d_type) {
          // The document ends when the type of the next node is
          // 'e_NODE_TYPE_NONE'.
          d_prefixes->reset();
          return 1;                                                 // RETURN
      }

      d_currentNode = nextNode;

      if (d_prefixes && 1 == d_nodeDepth) {
          // A 'TestReader' only recognizes namespace URIs that have the
          // prefix "xmlns:" on the top-level element. A 'TestReader' adds
          // such URIs to its prefix stack. It treats namespace URI
          // declarations on any other elements like normal attributes, and
          // resets its prefix stack once the top level element closes.
          adjustPrefixStack();
      }

      d_nodeDepth += d_currentNode->d_depthChange;

      return 0;
  }

  int TestReader::lookupAttribute(balxml::ElementAttribute *attribute,
                                  int                       index) const
  {
      if (!d_currentNode ||
          index < 0 ||
          index >= TestNode::k_NUM_ATTRIBUTES) {
          return 1;                                                 // RETURN
      }

      const char *qname = d_currentNode->d_attributes[index].d_qname;
      if ('\0' == qname[0]) {
          return 1;                                                 // RETURN
      }

      attribute->reset(
          d_prefixes, qname, d_currentNode->d_attributes[index].d_value);
      return 0;
  }

  int TestReader::lookupAttribute(
                                balxml::ElementAttribute * /* attribute */,
                                const char               * /* qname */) const
  {
      return -1;  // STUB
  }

  int TestReader::lookupAttribute(
                         balxml::ElementAttribute * /* attribute */,
                         const char               * /* localName */,
                         const char               * /* namespaceUri */) const
  {
      return -1;  // STUB
  }

  int TestReader::lookupAttribute(
                          balxml::ElementAttribute * /* attribute */,
                          const char               * /* localName */,
                          int                        /* namespaceId */) const
  {
      return -1;  // STUB
  }

  void TestReader::setOptions(unsigned int /* flags */)
  {
      return;  // STUB
  }

  // ACCESSORS
  const char *TestReader::documentEncoding() const
  {
      return d_encoding.c_str();
  }

  TestReader::XmlResolverFunctor TestReader::resolver() const
  {
      return d_resolver;
  }

  bool TestReader::isOpen() const
  {
      return d_isOpen;
  }

  const balxml::ErrorInfo& TestReader::errorInfo() const
  {
      return d_errorInfo;
  }

  int TestReader::getLineNumber() const
  {
      return 0;  // STUB
  }

  int TestReader::getColumnNumber() const
  {
      return 0;  // STUB
  }

  balxml::PrefixStack *TestReader::prefixStack() const
  {
      return d_prefixes;
  }

  TestReader::NodeType TestReader::nodeType() const
  {
      if (!d_currentNode || !d_isOpen) {
          return e_NODE_TYPE_NONE;                                  // RETURN
      }

      return d_currentNode->d_type;
  }

  const char *TestReader::nodeName() const
  {
      if (!d_currentNode || !d_isOpen) {
          return 0;                                                 // RETURN
      }

      return d_currentNode->d_qname;
  }

  const char *TestReader::nodeLocalName() const
  {
      if (!d_currentNode || !d_isOpen) {
          return 0;                                                 // RETURN
      }

      // This simple 'TestReader' does not understand XML that contains
      // qualified node names. This means the local name of a node is always
      // equal to its qualified name, so this function simply returns
      // 'd_qname'.
      return d_currentNode->d_qname;
  }

  const char *TestReader::nodePrefix() const
  {
      return "";  // STUB
  }

  int TestReader::nodeNamespaceId() const
  {
      return -1;  // STUB
  }

  const char *TestReader::nodeNamespaceUri() const
  {
      return "";  // STUB
  }

  const char *TestReader::nodeBaseUri() const
  {
      return "";  // STUB
  }

  bool TestReader::nodeHasValue() const
  {
      if (!d_currentNode || !d_isOpen) {
          return false;                                             // RETURN
      }

      if (0 == d_currentNode->d_nodeValue) {
          return false;                                             // RETURN
      }

      return ('\0' != d_currentNode->d_nodeValue[0]);
  }

  const char *TestReader::nodeValue() const
  {
      if (!d_currentNode || !d_isOpen) {
          return 0;                                                 // RETURN
      }

      return d_currentNode->d_nodeValue;
  }

  int TestReader::nodeDepth() const
  {
      return d_nodeDepth;
  }

  int TestReader::numAttributes() const
  {
      for (int index = 0; index < TestNode::k_NUM_ATTRIBUTES; ++index) {
          if (0 == d_currentNode->d_attributes[index].d_qname) {
              return index;                                         // RETURN
          }
      }

      return TestNode::k_NUM_ATTRIBUTES;
  }

  bool TestReader::isEmptyElement() const
  {
      return d_currentNode->d_isEmpty;
  }

  unsigned int TestReader::options() const
  {
      return 0;
  }
Finally, our implementation of balxml::Reader is complete. We may use this implementation as the TestReader in the first example.