Tuesday, June 11, 2013

Multiple XML Loops, Namespaces and schemaLocation with Talend Open Studio


If you have trouble working with Talend Open Studio's off-the-shelf XML components, tAdvancedFileOutputXML and tFileOutputMSXML, consider bringing in a third-party XML binding library.

When using Talend Open Studio's tFileOutputMSXML recently, I was unable to add namespaces into my XML document.  I noticed about 29 unresolved issues on Bug Trackers related to this component (Sept. 2011).  So, I brought in a third-party XML binding library as a workaround.

This is an example of tFileOutputMSXML that doesn't require namespaces.

XML Binding

An XML binding library marshals and unmarshals XML to and from Java objects.  Some libraries generate Java code by looking over an XML Schema (XSD).  XmlBeans, Castor, and Liquid XML Data Binder are XML binding libraries.  XmlBeans and Castor work on the command line; there may be some Eclipse plug-ins available.  Liquid Technologies has a GUI interface.

When your programming language is Java, you can write a Java program that manipulates these XML binding objects, building up different parts of the tree-like XML structure in your algorithms.  You can loop over a result set and create objects, call a function that creates some objects, etc.  Then, at the end of the program, you can serialize the resulting XML to a String or other output source.

Builder Pattern

In most of my XML binding work, I've brought in a Java Pattern called the Builder Pattern.  The Builder Pattern helps reduce the hassle of working with many objects by establishing facade-like functions.  For example, rather than

 Address a = new Address();
 a.setCity("Washington");
 contact.setAddress(a );

You might do something like

 builder.addContact("Joe", "Washington");

This screenshot shows an XSD 'cruise-ports.xsd'

cruise-ports.xsd
This UML diagram shows several classes generated from cruise-ports.xsd by Liquid XML Data Binder 2011.  The classes are matched with a Builder called CruisePortsBuilder which manipulates the generated objects.

Builder Class with Generated Classes
 CruisePorts, CruisePort, CruiseLine, CruiseLines, SnackShops, and SnackShop are Java classes defined by their respective XSD elements: cruise-ports, cruise-port, cruise-line, cruise-lines, snack-shops, snack-shop.  For more on how these classes were generated with Liquid XML Data Binder, follow this.

The implementation of the Builder class 'CruisePortsBuilder' consists of a method for each main loop: addCruise_line() and addSnack_shop().  There is an init() method that will clear out the document structure, set namespaces, and set toplevel attributes.

For the source of the Java builder class, follow CruisePortsBuilder.java.  Here is a Main.java that uses CruisePortsBuilder without Talend.

If you had many loops or more complicated logic, the builder would absorb the complexity rather than the Talend job.

Integration with Talend Open Studio

Talend Open Studio interacts with the Builder class in three stages.

  1. Initialize the Builder.  Load libraries, new(), put on globalMap
  2. For each data flow, call a Builder method with a row
  3. At the end, form an output XML string to be written to a file
TOS Job Interacting with Custom XML Builder

Stage 1 - Initialize the Builder


To initialize the builder, two tLoadLibrary components are used.  One contains the CruisePortsBuilder class plus the generated Java classes: CruisePorts, CruisePort, etc.  The second tLoadLibrary component loads the Liquid Technologies' proprietary libraries.  This two-part loading would apply to the other binding libraries mentioned in this post.

The tJava at the start of the job creates the builder object and puts it on the map after setting a namespace prefix.

Init
Stage 2 - Processing

CruisePortsBuilder contains a method for each loop in the XML document.  In this respect, it works like tFileOutputMSXML.  The job takes advantage of metadata defined as delimited files and uses standard Talend data flow connections to invoke the builder method.

Input Metadata
The row of a text file is sent to a tJavaRow where the builder is pulled from the globalMap and invoked.  This is the invocation of the addCruise_line() method.
Processing a Row with a Builder Method
The addSnack_shop() method contains similar looking code.   Any additional loops would also look like this with additional data flows.

Output

Output is handled using the standard Talend connections also.  A tJava calls the builder method 'toXmlString()', storing the results in a variable that written to the output row.

Gathering Output

The output file is a single-field "delimited" file.
Single-field Target Schema
Here is the XML result of the job.

XML Result
 Note the top-level attributes, namespaces, and the aliases that were specified in the Talend Open Studio job (see tJava_1).

It's best to use the standard Talend components where possible.  The Talend Exchange can also provide some functionality.  However, if you need features not available off-the-shelf, then XML binding libraries provide a workaround. A future post will demonstrate work with the xs:group element as an example. 

No comments: