Tuesday, June 11, 2013

Reading XML with Binding Classes in Talend Open Studio


When reading XML in Talend Open Studio, it's best to use the components shipped with the software: tFileInputXML and tFileInputMSXML.  However, for more control over the XML processing, use Java classes generated by a data binding tool like Liquid Technologies' XML Data Binder, XML Beans, or Castor.

Using the standard XML components for inputting XML means that your job can be supported by the Talend Open Studio community.  (See this blog post for an example.) Yet there are cases where adding Java code can improve the robustness of your job that show up when the XML is complex or can be written in many variations.  This Java code is best supplemented with a data binding tool which will generate Java classes from a sample XML or an XSD.  The following is a list of data binding tools that I've used.

The XML

The XML I'm working with in this post is based on the Recordare MusicXML standard for transmitting a musical score.  This is a section of the document.


 
 
   Music
 

 
 
   

 The score is divided into parts which are listed in both the header (part-list) and in individual sections (part).  A part is made up of measures.  The job in this post will handle single-part scores and will output the part-name along with a list of the measures.

Custom Java Code

In addition to Java classes generated by XML binding tools, I've written a class using the Java pattern "Adapter".  An Adapter class changes the interface of a class into something more handy.  In this case, the Adapter class digs into the generated Java classes and extracts nested members and collections.  There is also some null protection for objects that are not required in terms of the XSD.

The first method extracts a part-name.  If you're used to XPaths, the general flow of code like this is to make a method call for each path element.  part-list/score-part/part-name -> getPart_List()/getScore_part()/getPart_name().  There is a special protection given to score_part which will prevent a NullPointerException as I try to access the member variable.

public String getPartName() {
  String retVal = null;
  Part_list part_list = score_p.getPart_list(); // mandatory
  Score_part score_part = part_list.getScore_part(); // optional
  if( score_part != null ) {
    Part_name part_name = score_part.getPart_name(); // mandatory
    retVal = part_name.getPrimitiveValue();
  }
   return retVal;
}


The second method extracts a type-safe list of measure numbers.

@SuppressWarnings("unchecked")
public List getMeasureNumbers() {
   List retVal = new ArrayList();
   PartCol parts = score_p.getPart(); // mandatory
   Iterator iterator = parts.getIterator();
   while( iterator.hasNext() ) {
     Part p = iterator.next(); // mandatory
     MeasureACol measures = p.getMeasure();
     Iterator iterator2 = measures.getIterator();
     while( iterator2.hasNext() ) {
       MeasureA m = iterator2.next();
       try {
         retVal.add( m.getNumber() );
       } catch(LtException ignore) {}
     }
   }
   return retVal;
}


To create the class, call an empty constructor, then initialize with a string of XML.

public ScorePartwiseAdapter() {}


public void init(String xml) throws LtException, IOException {
 this.xml = xml;
 score_p = new Score_partwise();
 score_p.fromXml(xml);
}

The full code listing is here.

Job Design

After loading the libraries and putting a 'ScorePartwiseAdapter' object on the globalMap, a Clob (Memo) of XML is read in from an Access table.  The ScorePartwiseAdapter is initialized on this XML text and the convenient getPartName() and getMethodNumbers() method calls are made.  The result is a flow that fills up a two element schema: partName and methodNumbers.  methodNumbers is a Java List.

To process the Java List 'methodNumbers', a tLoop is used that will iterate over the elements.   This is initiated by a tFlowToIterate and followed by a tIterateToFlow.  I find it easiest to deal with flows in Talend Open Studio, but the iteration is needed because the Java List is not a flow.

Job Reading XML
 The ScorePartwiseAdapter object is created in tJava_1 using the default constructor (new ScorePartwiseAdapter()).  The object is then put on the globalMap using the key "score_a".

Calling ScorePartwiseAdapter Methods
 tJavaRow_1 fillls up a two-element schema with a partName String and a Java List of measureNumbers.

Output

The result of this test job is a log message to System.out.  However, any flow-based output component can be used.  To do this, a loop is executed for each XML document.  This loop is based on the measureNumbers list and tLoop_1 is configured as follows.

Iterating Over Java Collection

 It's getting partName and measures from a tFlowToIterate component that converts the two fields of the tJavaRow_1 schema into a pair of global variables.  The tLoop_1 component feeds into a tIterateToFlow component so that any flow-based component can be used.

tIterateToFlow Component
Input

The input in this job can be any flow-based input.  This example uses a Memo (Clob) field in Access.

Writing jobs using standard components gives your jobs the widest possible support across the Talend community.  This applies to working with XML.  However, there are cases where the off-the-shelf XML processing components aren't sufficient such as when the XML is complex, variegated, and or tied up with extract or loading processing.  In these cases, an XML data binding tool -- commercial or open source -- and an established Java pattern "Adapter" can make creating the TOS job easier.

No comments: