Towards the industrialization of XML processing
Last updated Saturday, December 08, 2001
As XML proliferates, so too does the need for scalable, robust XML processing. Current approaches to XML processing are varied with a high degree of overlap. There is no consensus as to how best to process XML and techniques with names like SAX, DOM, XSLT all have their fervent admirers and detractors, their pros and their cons.
XPipe is a methodology for XML processing that steps one degree above individual processing techniques in order to blend them into a single XML processing approach. Using the assembly line principle familiar from the world of physical manufacturing, XPipe promotes the decomposition of large, complex tasks into a sequence of smaller tasks to be performed in sequence.
XPipe is a methodology for the development of industrial strength, XML processing systems,
XPipe.net (will soon be) the home of an Open Source project that implements the XPipe methodology. The source code is hosted on Sourceforge (http://xpipe.sourceforge.net)
The world of XML processing in 2001 could most charitably be described as at pre-consensus stage.
There is lots and lots of XML out there. An increasing amount of it fulfils mission critical roles in organizations of all shapes and sizes in both public and private sectors.
XML processing is performed with an eclectic mix of standards, tools and paradigms adding a bewildering array of acronyms into the already burgeoning collection: XSLT, DOM, SAX, Trex, Relax, Schematron, XPath. There is also a wide variety of programming languages used including Java, Python, Perl, Ruby, Tcl, Omnimark, C# and so on.
The growth in popularity of XML and related technologies has been matched by a growth in the complexity of XML processing.
The complexity and density of XML processing has increased to the point where normal IT folk cannot wrap their heads around it and also be productive in their day jobs.
The complexity of the XML processing that will be demanded by emerging applications, will stress existing paradigms beyond breaking point.
We need to move beyond textbook examples of XML processing into robust, industrial strength solutions that can process terabytes of data unattended, function 24x7, be developed by engineering teams on time and within budget.
In short, we need to move XML processing—currently considered something of an art form—into a science.
Markup geeks (a fraternity I am proud to be part of) need to balance their love of the intellectual playground that is markup, with a willingness to help non-geeks productively use markup without dedicating significant work-cycles to it.
As the bioligist Edward O.Wilson puts it:
“The love of complexity without reductionism makes art. The love of complexity with reductionism makes science.”
We need to tame the complexity of XML processing. Reduce it to manageable component pices that ordinary folk can work with. We need to commoditize the complexities by hiding them behind black boxes that can be wielded without regards to their contents.
In the best traditions of software engineering, we can usefully look around us to other areas of engineering in search of ideas for how best to commoditize XML processing.
XML, ultimately is a complex hierarchical structure, created from raw materials (characters). XML processing consists of creating these hierarchical structures and transforming them from one shape to another.
Where else in the world of manufacturing can we find examples of complex, hierachical structures manipulated in a way that tames the complexity of the underlying forms?
Examples are easy to come by. The figure below is a photo from around 1914 of the Ford Motor Company manufacturing system for Model T Fords.
The figure below some cels from Walt Disney’s Oswald the lucky Rabbit dating from 1927.
The figure below shows a sandwich under preparation at a Subway store.
Here we see three different examples of complex processing that share the same approach to industrial strength manufacturing—the assembly line.
Although Henry Ford was the first to apply assembly line principles to car manufacturing in 1908, the principles can be traced back to Adam Smiths “An Inquiry into the Nature and Cause of the Wealth of Nations” dating from 1776. Smith discusses the advantages in pin making to be gained by decomposing the manufacturing operation into a number of discrete steps to be performs by individual workers.
Walt Disney’s largely forgotten Oswald the Lucky Rabbit dating from 1927, was the first cartoon animation to be performed using assembly line techniques.
The assembly line approach is so familiar from our everyday experience that we take it for granted.
The assembly line principle is not new to computing. Unix for example, makes significant use of pipe concept to allow independent programs to communicate via well defined inputs and outputs (see below).
The Unix pipe concept is a very powerful one. It is supported in Unix command shells by the “|” vertical bar providing a very simple way to create pipelines with a built-in “bounded buffer” scheduling system. The user can invoke the pipeline process depicted in figure 4 with the command:
A | B | C
However, the Unix pipe concept is significantly hindered by its reliance of the shared understanding of data as a collection of discrete lines, terminated by the newline character. The unit of processing for pipelined applications is generally a line of text. Examples include the common utilities grep, more, ls and so on.
For data that fits easily into a line structure with a simple sub-structure e.g. fixed widthg, whitespace delimited, colon delimited this works very well but for more complex structures it does not.
Before the days of XML, plain text with simple delimiters was in effect the highest universally accepted level of data interchange. Beyond that, developers relied on specific programming languages such as C to create data structures and corresponding APIs for accessing those data structures. This resulted in a class of applications that cannot be chained together into ad-hoc pipelines.
XML changes this situation. By significantly raising the bar of universally accepted data interchange to arbitrarily rich, hierarchical structure encoded in a plain text stream, XML allows us to revisit the powerful Unix concept of “pipe” and extend its applicability into the realm of the richer data structures that can be encoded in XML.
The assembly line approach to manufacturing has many benefits to offer XML processing. All the reasons for using it in the world of physical manufacturing apply with some extra advantages that accrue from the lack of fixed physical boundaries that is characteristic of software systems.
Input/output predicates and design by contract
Stages have their own input-output predicates – human and machine readable. Unit testing.
Fundamental Transformation Operations
There are only so many ways to chop a tree – get down to fundamental operations. Re-usability very achievable. Rubics cube. 4x1019 states. Hofstadter has identified seven fundamental operations. Also, at most 50 operations needed to get from any start state to the finish state.
Knot theory, Reiderimeister moves. Turing completeness. Or, more pragmatically, a CPU with a few hundred op codes that can do everything.
Use whatever algorithm, data structure fits the bill on a component by component basis rather than implement all in the same paradigm. SAX/DOM/Lexical/PSVI etc.
Transformation decomposition. Get multiple developers working in parallel very easily.
Rapid Fault Isolation
Log(n) fault location.
Metrics and monitoring.
Flexibility is an unavoidable byproduct of loose coupling. Easy to change around compared to monolithic systems.
The XPipe Bootstrap
XPipe is essentially a C-Activity “bootstrap” in the sense coined by Doug Englebart (http://www.bootstrap.org)
- A-Activity – what an organization does
- B-Activity – process improvement i.e. use of XML
- C-Activity – improve the process improvement
XPipe fits the description of a technology aimed at boosting “mankind’s collective capability for coping with complex, urgent problems.”.
Highly Scaleably via massive parallelism
Most real world XML processing amenable to domain decomposition – trivially parallelizable
- Schema level validation
- Business rule validation (Schematron, Trex, Regular Fragmentations) think of them as XML in XML out
- XML Metrics – tag share – key to where your developers spend their time. DTDs not statistical. Zipfs law applies.
- Legacy – wrap in xml in/xml out and existing systems can be componentised within reason.
Pipes Versus Graphs
Pipe as degenerate component of graph. As soon as you add a tag for “<IF>” you are sliding down that slippery slope.
Curves are a series of straight lines. Graphs are straight lines joined together. Components for <IF> etc. to create flows.
Speed is a three headed monster
Speed of execution, speed of development, speed of maintanence.
Get it to work before you get it to work fast. I have great faith in the ingenuity of the development community to take something that is simple, functional and slow and turn it into something that remains simple, functional and fast enough.
Ties processing to process model – data interchange via API calls. Makes life difficult for distributing the processing load.
Forces the hand of developers in terms of abstraction to use
Infoset problems – XML is just syntax – roundtrip difficulties
Support all. The Universal Component Interface is XML instance, XML instance out.
XSLT is not, despite efforts to market it otherwise, a general purpose XML processing technology.
Parallel execution does not help because the stages in the pipe need to be executed in order
Speed of execution is interesting as developers tend to think of it from a von Neumann architecture world view. I.e. if it takes N seconds to process 1 XML document is will take 100N to process 100 documents. Assumption of linear processing. With XPipe, can use any number of processors simultaneously. The elapsed time for processing any one doc does not increase (in fact it may decrease) what shoots up is througput. Think documents per second not seconds per document. How many organizations do you know with small collections of XML – XML comes into its own for large volumes – domains where throughput is more important than per-unit processing time.
Note this is one area where I think XSLT got the mix wrong.
Come back data flow diagrams – all is forgiven
Data flow diagrams (DFDs) were pioneered Gane and Sarson and De Marco in the Seventies. Data flow diagramming was an approach to systems analysis that focused on how data flows and is transformed in order to achieve some system goal.
The example in figure X is based on an example from Gane and Sarson in 19XX. What is remarkable about it is how, well, 21st century it looks if you think of the processes as Web services and the data flows as being XML documents.
Software is about how things change w.r.t. time-to model systems, model how things change. M.A. Jackson.
(Data flow diagrams from the Seventies.Now with XML as the data and DTDs/Schemas for controlling the
structure of the data flow.)
Even crappy XML is better than non-XML. RDB , RTF etc. get ‘em into XML and then get tools at’em.
XPipes are clearly web services.
SOAP is only the start of webservices. In fact, it is not even the interesting bit. Web services so far all about end-points. Just duplicates a failed model – distributed code versus distributed data.
Example of addition as an xml-xml transformation.
Message queues have an important part to play. In effect, the Unix "pipe" facility created message queues between separate processes. To extend the pipe concept to the Web, the Web will need queues and will need to support asynchronous request/response. Can implement asynchronous request/response on top of HTTP.
JMS could prove as vital here as SAX.
Inbetweening – by analogy with animation industry
Remove IO – turn pipe into a component. Nicked this from a technique known as program inversion invented by Michael Jackson.
Lots to learn from physical manufacturing Job Shop Scheduling, Bucket Brigade. Metrics, feedback etc.
Thinking about the future of industrialized XML processing
A worldwide grid of XML processing nodes – intergrid
A company wide grid of XML processing nodes – intragrid
A communally owned grid – extragrid
TSPs – XML service providers. Provide access to their grid for volume based fees.
Fractal distribution patterns of utilities such as water, electronicity, aeroplanes etc. All share same architecture. Thinking of information and information flow as a utility like any other is interesting. Create information flows so that the info is “close” to where it needs to be when it is needed. Semantic bifurcation points and information gets more and more attuned to the needs of the target audiences. Development of information silos in a fractal pattern.
Taxonomy of components and pipes
Taxonomy of XTLs (XML Transforming Languages)
XML round-tripping and data model issues.
 Consilence – the unity of knowledge, Vintage, 1999, 0-679-45077-7, Page 59
 Structured Systems Analysis (Prentice-Hall, 1979).
 Structured Analysis and Systems Specifications (Yourdon Inc., 1978)