Towards the industrialization of XML processing

Last updated Tuesday, December 18, 2001

XPipe Presentation

XPipe - An XML Processing Methodology

XML 2001 Florida, USA
December 13 2001

Sean McGrath

What is XPipe?

  • It is an architecture / methodology /framework for developing robust, scaleable, manageable XML processing ystems.
  • Based on proven mechanical manufacturing techniques. Specifically:
    • The Assembly Line Principle
    • Component assembly and component re-use
  • An open source project hosted on Sourceforge (http://xpipe.sourceforge.net)
  • A contribution to the blossoming meme of using pipeline based processing to tame the burgeoning complexity of XML transformations

(If you do not find XML transformation complicated, you are not sufficiently well informed.)

(And no, XSLT does not solve all your problems!)

Contents of this talk

  • The XPipe philosophy
  • Major functional elements
  • Some examples
  • Relationship to other technologies
  • The XGrid
  • Some anticipated objections (and answers)
  • Current status
  • Current problems
  • Future plans

The XPipe Philosophy

Cars Are complex, hierarchical structures

 Henry Ford’s Model T Ford Assembly Line – 1914


Lunch is a complex hierarchical structure
 Lunch under construction at a Subway store


We are complex, hierarchical structures created on assembly lines.

Human tendon showing complex hierarchical structure

What have these scenes got it common?

  • Complex construction of cars, tuna melts and tendons made possible and efficient through
    assembly line manufacturing
  • re-usable component processes and component materials
  • Why not apply this approach to XML “manufacturing”?

Why does the assembly line approach work?

XPipe philosophy

  • A lot of data processing will consist of XML to XML transformation
  • A lot of non-XML data processing can consist of XML to XML transformations with the addition of  top and tail transformations:-
    XML to XML transformation with possible non-XML start and end-pojnts

  • Mantra
    • Get data into XML as quickly as possible
    • Keep it in XML until the last possible minute
    • Bring all your XML tools to bear on solving the data processing problem
  • The philosophy hinges on the fact that every complex XML transformation can be broken down into a series of smaller ones than can be chained together:-
Any complex XML transformation is a series of smaller, less complex transformations chained together
  • There are only so many ways to re-arrange an XML tree structure. Consider Rubics Cube - a complex transformation to solve but there are only a certain number of fundamental transformations involved
A complex transformation made up of a finite number of fundamental transformations


  • A finite number of fundamental transformations, from which all higher order transformations can be derived
  • Transformation Decomposition leads to:
    •  a series of small, manageable, “stand alone” problems with an XML input “spec” and an XML output “spec”.
    • Can build, test, use and then re-use these transformation components
    • Very team development friendly
    • High cohesion, loose coupling – just like the professor advised

More XPipe philosophy

  • Pipeline approach means you can mix ‘n’match black-box components that internally use whatever paradigm best suited the problem
    • Lexical
    • SAX
    • DOM
    • XSLT
    • XDuce, Pyxie, Haskell…
Stages in an XPipe can use whatever paradigm best suites the problem at hand
  • Assertion : developers would use a component based approach to XML processing if they did not have to write the plumbing (orchestration, exception handling) themselves

“Gee, this problem is complex. Maybe I’ll do it in multiple stages! Gee, now I have to orchestrate the stages somehow. Batch files/shell scripts/driver program – all ugly and error prone. Maybe I’ll just write a single program after all…”

  • “Professional developers spend 50 percent of their time writing plumbing” – Adam Bosworth
  • I disaggree. It is at least 60%.
  • XPipe aims to look after the plumbing letting developers concentrate on the interesting stuff

Major Functional Elements – XComponents

  • Developed in any language that runs on the Java Virtual Machine (Jython, Java, XSLT, Rhino (JavaScript)  etc.)
  • All XComponents are standalone programs of the form

[Name] [InputXML] [OutputXML] [ErrorXML]

  • XComponents described in XML form. An Xcomponent consists of:
    • Documentation
    • Unit Tests (input,output XML stream pairs)
    • Metadata for retrieval
    • Input and Output predicates – declarative (DTD/RelaxNG/Schema) or procedural (code)

Major Functional Elements – XComponent Unit Tester

  • Standalone program analogous to JUnit or PyUnit but for XML transformation component testing
  • Very outsource-friendly and “inbetweenable” approach (specify everything but the code == spec+doc+test harness all in one)

Major Functional Elements – XPipes

  • Described in XML
  • They consist of
    • Documentation
    • Input/Output Predicates (Schemas/Code)
    • Test Suite
    • References to XComponents which are resolved when the XPipe is compiled

Major Functional Elements – XPipe Executive

  • Uniprocessor: XPipe executed on 1 machine, possibly with separate threads for each XComponent task
  • Multiprocessor: XML based protocol to implement “Job Shop” work distribution over a P2P network

Major Functional Elements – XPipe Monitor

  • Analagous to monitoring systems for fluid flow systems.
  • SCADA based systems have a lot of potential here

Some related open technologies

  • | - Unix Pipes
  • SAX Filters
  • TRAX
  • XBeans
  • Cocoon
  • axKit
  • JXTA
  • Translets
  • TupleSpaces

Simple XComponent examples

  • Fundamental Operation – Rename Element
  • Rename
    Input : <foo>baz</foo>
    Output: <bar>baz</bar>
Rename of foo element to bar element


  • Fundamental Operation - Peel
    Input : <foo><bar>baz</bar></foo>
    Output: <foo>baz</foo>
Peeling a bar element
  • Compound Operation - Matryoshka
    Unravelling elements like Russian Dolls

  • KlingonCloak

<tag name=“foo”><tag name=“bar”>baz</tag></tag>

Making elements invisible but retaining the element type names


  • Once you start thinking in terms of Pipes – components appear everywhere:
    • Regular fragmentations
    • Doctype changer
    • Namespace normalizer
    • Character set transcoder
    • Hash generator
    • RelaxNG/Schematron etc
      A validator can be thought of as a component in an Xpipe that mirrors its input on its output
Validation as an XComponent

The XGrid

  • Grid Technologies – computational power “on tap” (http://www.gridforum.org)
  • The XGrid – computational power “on tap” to execute XPipes
XGrid - massively parallel XML processing with grid technology


Some objections (with some answers)

  • It will be slow
    • No it won’t - Premature optimization is the root of all evil!
    • Speed is a three headed monster.
Speed is a three headed monster

I’m old enough to have left the X axis and currently heading for Y through Z

    • Besides, massive Parallelism will kill all von Neumann throughput arguments
    • "Documents per second" is the important metric - not seconds per document
    • A myriad of “compile time” optimizations on XPipes possible
    • Keep the architecture simple – and speed will sort itself out
  • Pipes are not rich enough, real data flows require graphs
    • Inside every graph is a collection of straight segments
    • Do the smallest thing than can possible work
    • XComponents can conditionally flow data in different directions – graph
  • Component based software? Harumph! We have heard that one before…
    • XPipe is data flow based not API based (COM, VBX, CORBA).
    • The payload is what is important – not the plumbing
    • Information integration (needed on the server side)– not application integration (needed on the client side)

Current Status

  • Schemas for XPipes and XComponents on xpipe.sourceforge.net
  • Sample components (Java/XSLT/Jython) and some documentation
  • Simple, illustrative XPipe uniprocessor executives
  • Draft of XJCL – XGrid Job Control Language
  • Uniprocessor XPipe used to develop
    • 80-C pipe from Hub notation for a complex document type to a legacy mainframe display notation. 120 page spe
    • 20-C pipe for semantic validation of legislation documents
  • Xpipe and XComponent validators

Current Problems

  • Everybody agrees that an XML document is a tree but:
  • The content and structure of the tree depends on the parser
  • The content and structure of re-generated XML  (The round-tripping problem)
  • Naming things
    • Taxonomy of XTLs (XML Transformation Languages)
    • Taxonomy of re-usable XComponents and XPipes
  • Flexible transformation scheduling is hard
    Optimal transformation scheduling is very hard
  • Packaging

Future Plans

  • Evangelize the idea that DTD validated XML 1.0 is just Well Formed XML that has been through a pipe consisting of:
    • A transclusion component (entity expansion)
    • A macro pre-processor (conditional marked sections)
    • An attribute decorator (implied/fixed attributes)
    • A grammar checker
    • Valid XML
Valid XML as a pipeline transformation of valid XML


  • XPipes and XComponents as web services (SOAP/XML-RPC, UDDI etc.
  • Getting the P2P and Grid Technology communities input into XGrid.
  • Getting help to develop the XPipe reference implementation on Sourceforge
  • Development of commercial implementations of XPipe integrated with leading EAI systems (Ongoing by Propylon)
  • Use of SCADA tools to develop XPipe process control and monitoring systems
  • Use of Animation Engineering techniques for CAXTE tools (Computer Aided XML Transformation Engineering)
  • Digging around hierarchy theory, self-assembly, bio-informatics and nanofabrication for concepts and tools applicable to XML transformations

In conclusion

  • XPipe is simple
  • Simplicity works!
  • Plenty of evidence outside of XML engineering that this approach will work
  • Plenty of lore and tools from other fields of science can be brought to bear to build systems using the XPipe approach

Thank you

Sean McGrath