Translate

Saturday, June 18, 2016

Data Serialization

A data serialization standard is required any time you need to to represent data externally from a program.

Two of the most popular data serialization standards are XML and JSON, both human-readable data serialization formats, a feature that has undoubtably contributed to their popularity.

XML is used heavily in enterprise contexts, although currently it seems that JSON is gaining in popularity over XML also in this space.  XML is a document markup language, and was not originally designed for data serialization.  Without a schema such as XSD or external type encoding rules, XML documents do not innately store data type information.

The SOAP protocol is based on XSDs; it is a very complex protocol with a hierarchy of complex specifications.  SOAP originally stood for Simple Object Access Protocol, but this acronym is no longer used as of SOAP 1.2.   SOAP ticks a lot of boxes that make enterprise architects happy, such as:
  • Transport independence
  • Distributed processing model
  • Design by contract (interface specification separate from the implementation)
  • Synchronous and asynchronous messaging
  • Universal standard
In my opinion, there is nothing simple about SOAP.  Also it's status as a "universal standard" is also questionable in practice because of incompatibilities between different vendor implementations of SOAP due to the complexity of the specifications.

I once had a conversation with an engineer from a global tier-1 integration software company (whose name will be left unmentioned) who related an anecdote to me how his company's SOAP solution was so good and so standards-compliant that they had recently won a lawsuit where one of their clients alleged incompatibility in their SOAP implementation due to failure to make a critical interface or set of interfaces work and sued them for negligence.  According to this engineer, it turned out that they were able to prove that their SOAP implementation was correct in court and won the suit.  The fact that the issue had to go to court at all is a testament to the complexity of SOAP; taking pride in having your customer sue you because you couldn't make an interface work (basically your only purpose as an integration ISV) and then winning the suit is another subject better left alone.

One simple extension of XML to support data typing is XML-RPC, which allows for several data types including lists (array) and key-value pairs (struct) to combine the simpler types to make arbitrarily complex data structures.  It's straightforward and simple, easy to ready, easy to serialize and deserialize.  It doesn't have the list of enterprise features that SOAP does, but to me it looks like a simple protocol designed by a single engineer, whereas SOAP looks like a typical protocol "designed by committee", where good engineering ("keep it simple, stupid") takes a back seat to politics and complexity is not avoided but rather embraced (a great way to destroy a technology or project in my experience!).

Nowadays you see more and more JSON.  JSON is an elegant data serialization format that was designed for the purpose it's used for.  JSON is human-readable and is the native format for JavaScript, a language that's grown in popularity along with the web.

For example, here is a JSON string:
{
 "firstName": "John",
 "lastName": "Smith",
 "isAlive": true,
 "age": 25,
 "address": {
   "streetAddress": "21 2nd Street",
   "city": "New York",
   "state": "NY",
   "postalCode": "10021-3100"
 }
}

The same string with XML-RPC encoding looks like:
<struct>
  <member>
    <name>firstName</name>
    <value>
      <string>John</string>
    </value>
  </member>
  <member>
    <name>lastName</name>
    <value>
      <string>Smith</string>
    </value>
  </member>
  <member>
    <name>isAlive</name>
    <value>
      <boolean>1</boolean>
    </value>
  </member>
  <member>
    <name>age</name>
    <value>
      <i4>25</i4>
    </value>
  </member>
  <member>
    <name>address</name>
    <value>
      <struct>
        <member>
          <name>streetAddress</name>
          <value>
            <string>21 2nd Street</string>
          </value>
        </member>
        <member>
          <name>city</name>
          <value>
            <string>New York</string>
          </value>
        </member>
        <member>
          <name>state</name>
          <value>
            <string>NY</string>
          </value>
        </member>
        <member>
          <name>postalCode</name>
          <value>
            <string>10021-3100</string>
          </value>
        </member>
      </struct>
    </value>
  </member>
</struct>

Because it was designed for data serialization, JSON is better suited to this task than XML, even with simple extensions for data types like with XML-RPC.

Even better than JSON, but not as popular, is YAML (the subject of another blog post some time ago) .  YAML 1.2 is backwards-compatible with JSON, and is also extensible.  With YAML you get an elegant, human-readable data serialization standard that allows itself to be extended to support application or platform types.

The above data structure looks as follows with YAML (using block style for formatting):

firstName: "John"
lastName: "Smith"
isAlive: true
age: 25
address:
  streetAddress: "21 2nd Street"
  city: "New York"
  state: "NY"
  postalCode: "10021-3100"

As with JSON, despite the lack of explicit type information, types are unambiguous and are preserved in the YAML string above.

YAML is extremely suitable for data serialization and therefore also for use in data exchange protocols.

Qore supports XML, JSON, and YAML for data serialization, but YAML is the preferred mechanism when possible to use.

Qore's YAML support is provided by the yaml module.  Qore implements a YAML-RPC protocol, which is basically JSON-RPC 1.1 but using YAML for data serialization.  Additionally, the DataStream protocol is a point-to-point protocol using YAML serialization for sending large volumes of data over HTTP with a small memory footprint.

From an engineering perspective, YAML is simple, elegant, extensible, and does a very important job very well.

While there are compelling arguments to be made for binary data serialization standards in some contexts, when you need an elegant, human-readable data serialization approach, I would highly recommend taking a look at YAML.