Tuesday, June 11, 2013

Handling xs:group in Talend Open Studio


If you work with XML, you may see the xs:group feature in an XSD.  To work with XML based on this XSD in Talend Open Studio, consider using an XML binding tool and a custom Java class that implements the Builder Pattern.

Background

xs:group allows XSD authors to re-use a list of elements (sequence, choice, all).  'xs:group', called Model Group in the specification, is different than xs:complexType.  xs:complexType needs element definitions.  xs:group will use the elements in its own definition, inserting them into a complexType via .

Complex Type

For example, given a complexType 'PhoneNumberType' with elements 'number' and 'countryCode',








 Another type using PhoneNumber type is like this









'contactPhone' is defined for ContactType.  Another type, say 'BusinessType', might have a definition like



XPaths to get the numbers' country codes would be contact/contactPhone/countryCode and business/businessPhone/countryCode.

Model Group

A Model Group is defined in a similar fashion to Complex Type.



 
 



The syntax for a referencing type is different.  A 'group ref' is added to the sequence rather than a new element.








This will generate XML accessed by XPaths like contact/countryCode or business/countryCode.  There is no intervening element wrapping up a type required as there is with PhoneNumberType.  However, an author may map a Group to a single Element Sequence and then to a Complex Type.

An author might do this if he or she wanted to use an element's name consistently.  A Group built starting with a single Element Sequence would have that element name consistently available throughout the definition.  For example, a Group with a Sequence containing only Element 'phone' would make 'phone' available to every reference rather than 'businessPhone', 'contactPhone', or 'phn' as could be with a Complex Type.

Talend Open Studio and Complex XML

When possible, use the Talend off-the-shelf XML components. If you're outputting XML, try to create a file output XML based on the XSD.  Ideally, your job would look like the screenshot following this paragraph.  However, the Recordare MusicXML XSD I was working on caused numerous problems for Talend: empty XML elements, invalid XML, no data mapped.  If your XSD requires more than one loop element or can't readily be processed by Talend, consider incorporating a third-party library and a custom Java class that implements the Builder pattern.

The Job I'd Like to Write


For a detailed explanation on XML binding and the Builder pattern, read this blog post.  The following is a second example of a custom Java class written for Talend Open Studio that does handle the XSD from the preceding screenshot.

Builder Class

The Builder Pattern constructs a complex object like an XML document in stages.  In this example, a stage is a loop in the XSD.  While the Recordare MusicXML XSD contains hundreds of elements that can express pagination and formatting, my simple data is based on Measures and Notes.  So, the Builder Class, ScorePartwiseBuilder, is correspondingly simple. Yet supporting ScorePartwiseBuilder is a complex set of over 400 generated classes.

To get a feel for this pattern, look at this main() method run outside of Talend Open Studio.

ScorePartwiseBuilder bld = new ScorePartwiseBuilder();


bld.init("A1", "Music");


bld.addMeasure("1", "0", "4", "4", "C", "2");
bld.addNote("C", "1", "1", "quarter");
bld.addNote("D", "1", "1", "quarter");
bld.addNote("E", "1", "1", "quarter");
bld.addNote("D", "1", "1", "quarter");


System.out.println( bld.toXmlString() );


After creation (new), an init() method sets some fields which will be represented as a block of XML.  A measure is added with addMeasure() and notes are added with addNote().  For each measure and sequence of notes, this pattern is repeated.  The Builder Class keeps track of the internal state -- the current measure -- and eventually will render the document to a String with toXmlString().

Builder Class Referencing XML Binding Tool Classes

The full source of SourcePartwiseBuilder is available here.  This post is working with Liquid XML Data Binder though open source tools like XmlBeans work too.

Job Design

The sequence of steps in a Talend Open Studio job starts with loading libraries and creating the builder class.  Next, text file input is sent to the builder.  Finally, an output string is formed.

Job Using ScorePartwiseBuilder
 The tLibraryLoad components load the JAR for the builder and generated Java classes and also the JAR for the runtime, provided by Liquid XML Data Binder.  tLibraryLoad_1 also includes an import statement for ScorePartwiseBuilder.

The tJava instantiates the builder class, initializes the object with some header information, and stores the builder on the globalMap for use in later components. tJava_1 also sets a current measure variable (curr_measure) that ensures that tJavaRow_1 will have a value.

Input

The input for this example is the following text file.  Note that the input is an extremely condensed data set compared to what's possible with the schema.

Measure,Fifths,Beats,BeatType,Sign,Line,Step,Octave,Duration,Line
1,0,4,4,C,2,C,1,1,quarter
1,0,4,4,C,2,D,1,1,quarter
1,0,4,4,C,2,E,1,1,quarter
1,0,4,4,C,2,D,1,1,quarter

 The code behind tJavaRow_1 will create the Measures and Notes.  Notes make up Measures, so a Measure doesn't need to be created for each Note.  This is coded using a flag 'curr_measure' which is adjusted for each new Measure.

Invoking the Builder Commands
 The output is the following XML.  I made one adjustment to the document to remove an extra timestamp that I hope will be resolved by Liquid XML's support.

XML Output Enforced by XSD Group
 It's best to work within the available components and routines of Talend Open Studio when constructing jobs.  However, when XML processing becomes to complex in terms of schema or algorithm, consider using this builder pattern to streamline things.  Granted, it's a Java heavy solution, but if you have a Java resource available, this type of class is quite easy to create.

No comments: