Towards the industrialization of XML processing
Last updated Sunday, December 09, 2001
Q. Where did XPipe come from?
A. XPipe had its genesis in a C++ toolkit for SGML processing, developed by Sean McGrath in 1993/1994.
Struggling to wrap his head around a complex SGML electronic publishing task, Sean concluded he was not smart enough to write one program to do the processing. He was however, smart enough to split the problem into pieces which could be chained together to solve the overall problem.
In the intervening years, XML has happened, Java has happened, Web services are happening... None of these change the fundamental fact that splitting a problem into pieces is a good idea. XPipe is the name given to the XML transformation philosophy that has grown out of that simple idea.
Q. There are other freely available technologies that use an assembly line approach to XML processing out there. How does XPipe compare to them?
A. A quick look at some of the main ones - please send info/corrections and I'll add them here
- Unix Pipes - The Grand Daddy of them all! Ken Thompsons pipe concept added to Unix in 1972 got the ball rolling for pipeline processing of text. It worked (and continues to work) very well for line-oriented textual information. However, the beautiful simplicity of standard in, standard out, standard error + bounded buffers is not well suited to more complex, hierarchical information flows - such as one finds with XML systems. I think of basic XPipe as being an attempt to take what was great about the original Unix pipe idea and apply it for structured information streams based on XML. (All the stuff to do with XRigs, XGrid, Web Services etc. have no analog in the Unix Pipe world but some do have analogies with JXTA.)
- Xbeans - XBeans is definitely on the same page as XPipe. It uses the Java bean abstraction and the DOM APIs.
- SAX Filters - Again similar in concept to basic Uniprocessor XPipe but specific to SAX event streams
- TRAX - An API for creating XML transformations. An XPipe, designed for execution under a single data flow can be considered as a declarative syntax for the transformations imperatively expressed with TRAX API calls
- Cocoon - Demand Driven, SAX based framework for XML processing with an emphasis on Web publishing apps.
- axkit - Demand Driven, Perl based framework for XML processing with an emphasis on Web publishing apps.
- transmorpher - Cocoonish, XSLT based transformation with some interesting flow based primitives.
- gnu.jaxp.pipeline - A variation on SAX filters based on wrapping and chaining SAX handlers (ContentHandler, LexicalHandler, etc.)
Q. Is there any documentation?
Yes but not enough. There is never enough...
Q. How can I get involved?
Q. Is there a mailing list?
Q. Why base it on the Java Virtual Machine?
It seems like a reasonable lowest common demoninator to aim at. With the advent of .NET, I expect to see a VM turf war break out and I want to make XPipe as VM independent as possible. This is one of the reasons I like higher level languages (such as Python) for XML processing as they are more likely to port easily to other VMs.
Q. Why are there no DOCTYPE declarations in the XML files for XComponents, XPipes etc?
The short answer is that I (Sean McGrath) believe this is the right thing to do because of the round-tripping can of worms the DOCTYPE opens for parsing if an internal subset should happen to be included in the DOCTYPE.
Longer answers can be found on xml-dev e.g.
Q. Why are there no XML declarations in the XML files for XComponents, XPipes etc?
The short answer is that I (Sean McGrath) believe that we should all just use UTF-8 and be done with it. The problems caused by supporting multiple encodings of Unicode just aren't worth it. Some examples:
Lets say I want to validate an instance against a DTD. To do so, I need to prepend a doctype declaration to the instance before passing the data stream into my validating XML parser. If the instance has an XML declaration, it must be the first thing in the instance. Therefore I need to detect the presence of an XML declaration, find the end of it and splice my doctype declaration into the instance. But getting this right in the face of all possible encodings is complicated. It is basically the first part of a fully blown XML parse. No thank you.
Lets say I decide to handle multiple encodings. My parser hands me UTF-8 regardless of the input encoding. Converting back to the original encoding is a pain. Not only that but I need to detect what the original encoding was. Not trivial. No thank you.
Simply put. All XPipe XML is UTF-8. Period. This maximises the simplicity of what is a thoroughly complex area. There be dragons. Lets not go there.
Q. Is this FAQ complete?
Noooooooooo. Sorry. We update it as often as we can. The best way to get something into this FAQ is to send a question to the XPipe developers list, or even better send us a question and a corresponding answer.
Q. Why do exceptions by XPipe all take 0 as their first parameter?
The 0 is a place holder for what will, prior to 1.0 release, become a unique integer used to associate useful documentation with every error message produced by XPipe.
Q. Why restrict XComponents to one input XML file and one output XML file?
Firstly because it is the simplest, least obtrusive requirement to put on any developer. Most developers (especiallly those from a Unix background) have an affinity with this way of thinking about data processing.
Secondly because the XRig level of the XPipe architecture handles multiplicity of inputs/outputs, complex graph-like workflows etc. without complicating life at the XComponent level.
Developers of XComponents do not need to know anything about XPipe - never mind XRig. For all they know or care, they are developing standalone programs that follow the time-honoured template of input/output files supplied as command line parameters. This mechanism also makes it straightforward to automatically or semi-automatically convert existing XML processing programs into XComponents. It also greatly facilitates spreading work amongst a team of developers and/or sub-contracting development to external XComponent suppliers.
XComponents in this category include split, scatter, gather, triggers etc.
Q. My code consists of a bunch of Java classes. What is the best way to make an XComponent from them?
The first thing to note is that the XComponent model does not set out to handle all the gory details of multi-class Java application installation. It is just too complicated and diverse a problem for any declarative syntax for it to work.
So, for any multi-class Java application, create a "distribution" using whatever your mechanism you like (Jar file, LiftOff install etc.)
In the <Doc> section of your XComponent, include instructions for locating and installing your app. (The XComponent file can be considered an "output" of your build process so you might create the XComponent instance as part of an Ant/Make build process.)
For the <Code> element you have a number of choices:
- Using <Code Type ="JavaJAR" ExtractFilename = "XXX.jar" Encoding="BASE64">, include a Base64 encoded JAR to access your app using the input,output,logfile,[params] signature shared by all XComponents.
- Using <Code Type ="JavaClass" ExtractFilename = "XXX.class" Encoding="BASE64">, include a Base64 encoded class file to access your app using the input,output,logfile,[params] signature shared by all XComponents.
- Using <Code Type ="JavaSource">, include a driver program to access your app using the input,output,logfile,[params] signature shared by all XComponents. This requires the host machine to have Java compilation and is probably best used during application development only. It is also better suited to "simple" transformations just using core JDK/JAXP facilities.
- Using <Code Type = "Exec"> Include the command line necessary to invoke your app. e.g. "java -cp foo.jar net.xpipe.bar"
Note in all cases you can specify a CLASSPATH attribute to add entries to the CLASSPATH prior to invocation.
See XComponents directory of XPipe distribution for examples.
Q. What is the relationship between Pyxie2 and http://www.pyxie.org
Pyxie2 represents a serious overhaul of the original Pyxie concepts. The changes were primarily in three area:
- Jython support - Pyxie2 is equally happy in C Python and Java Python environments
- SAX is now used exclusively as an event source for tree building
- Pyxie2 has been modularised out into separate modules in a Python package
Pyxie2 will be maintained along with XPipe but will not be made dependent on XPipe so that it can be used independently.