The importance of schemas, where’s that XSD file?

I’m coming across more and more projects that make heavy use of xml for data integration or data exchange that lack of schema files.

First case:
My team was POC-ing the integration with a third party web service. I won’t mention names but this company is one of the main sources of collected data for rating. After a purchased subscription for monthly text files with data dumps we are also able to access their web service. This web service provides a restricted interface to the data available on the text files. The service call is an http request with some parameters passed via POST (I would say it looks REST-ful). There is no WSDL published and the service returns an XML.
The main problem here is that the documentation on the XML is a PDF, yes a PDF! that will give you examples of the XMLs obtained in the response. Yes, samples with real data and a data dictionary, for instance the Product Element means XYZ, the codeA element means DFG.
All that amount of documentation looks fancy for the untrained eye, and most business analysts look puzzle when we say the documentation is not complete… why? There is a 200 pages PDF there!!!
Yes, but is there a schema that will validate all possible combination of XML responses? No.
Will the service consumer be prepared for all possible combination of XML responses? with luck and a lot of trial and error…

From W3schools:

The purpose of an XML Schema is to define the legal building blocks of an XML document, just like a DTD.

An XML Schema:

* defines elements that can appear in a document
* defines attributes that can appear in a document
* defines which elements are child elements
* defines the order of child elements
* defines the number of child elements
* defines whether an element is empty or can include text
* defines data types for elements and attributes
* defines default and fixed values for elements and attributes

Second case:
The second time this month I came across a similar xsd-less approach was in a data exchange project. The project consisted of extracting data from two databases, building a huge xml file and applying transformations to that file to comply with CSIO standards. The transformed XML would be consumed by several external applications.
After seeing a few of the initial xml files with with the data dumps I asked, where’s the schema definition? and the answer was, there is none… Apparently the the development of the transformation engine had been contracted to a third party company that had advised the tweaks to be made on the xml to be able to transform into an AL3 file. All by trial and error…

Third case:
This case was actually some years ago. I recall suffering the consequences of an xsd-less approach when I was in charge of uploading data batches into a database to power an online quoting tool.
The XML file was provided periodically by a third party company and registered the sale prices of pre-owned models produced by that manufacturer. The data exchange wasn’t fancy, they would send me the 2GB+ xml file by secured email.
I was in charge of parsing that DOM, transform it to a schema the online tool could consume, get the differential data and dump it (via a SQL Server DTS) into the database the online tool will be fed from.
As it was a .NET project I used XPathNavigator heavily to avoid loading all the DOM in memory with an XmlDocument, my machine at the time had 512 MB.
The first two attempts worked fine, but the subsequent data batches were not so straightforward. My XPath expressions kept failing. Why?
The differences were minimal, anyone that hasn’t parsed XML or deserialize XML will consider them nuances:
One child node was before another one while in the previous document they were in reverse order, one piece of data that was before in the text area of an xml element was now an attribute of the parent element . Some elements were null sometimes, while sometimes they were completely omitted from the file and the list goes on.
I learned the lesson the hard way, we told the other end, no, we need to agree upon a schema and if the file you sent us cannot be validated against the schema we won’t acknowledge the reception…
I still jitter when I see XML exchange without an XSD…but it seems to be a recurrent theme.

3 thoughts on “The importance of schemas, where’s that XSD file?”

  1. @dontcare

    Thanks for pointing out VTD-XML. To be completely honest I haven't read much about it, will do now.
    It seems to me a new way to index XML documents that don't rely on loading the entire DOM in memory but building an index of it in memory. It uses XPath to create these indexes, yet in order to build the Xpath expression you should somehow comply with certain schema? no?
    What happens when one of the changes on my xml document is putting the data from an element on an attribute? My original Xpath expression won't get this data right…what am I missing from vtd-xml?

  2. It loads the data in memory, into VTD records, instead of DOM nodes, which is a lot more efficient in memory usage… it doesn't use XPath to create indexes, it actually allows xpath to be evaluated over the index… it is schemaless, so conformance should not be an issue here

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.