Tuesday, June 11, 2013

XPath Loops in Talend Open Studio


When analyzing an XML document for processing, one tends to think top-down.  For example, "in a library, give me all the books" implies a structure.  However, it may be easier to think from the innermost elements, outward: "give me all the books and their library".

Talend Open Studio uses the tFileInputXML component to read XML documents into a job.  tFileInputXML uses a Loop XPath query to to define a repeating structure in the document against which a series of mapping XPath queries are run.  There is a mapping XPath query for each schema data field to be set during processing.

Bottom-up Processing

When working with a hierarchical structure like a filesystem, one starts as the top and drills down to lower level elements.  However, in XPath processing with Talend Open Studio, it's important to start with the lowest-level grain that will define a record.  For example, the following XML document has ID elements in a Location element contained with an IDs element.


   
     ABCDE
     
     XYZ
   


The first step in processing this XML document is to determine whether each ID is a record (in which case there will be three rows produced by tFileInputXML) or if the IDs element defines a record (only one row).

Starting with the lowest-level possible, this Talend job produces three name / value pairs, one for each ID element.  The loop is set to Locations/IDs/ID. @sequenceName returns the attribute value of sequenceName.  The period (".") returns the text in the ID element.  The period stands for the current element which is the ID defined in the loop.

Each ID Defines a Record

An alternative way of processing the Locations document is to specify the loop element as Locations/IDs.  In this example, a single record will be produced.  There are attribute selectors ([@sequenceName=""]) that map each ID element to a different field.

Containing IDs Element Defines a Record
 In other cases, there may be extra information in the parent required by the child.  This extra information may provide identifying or contextual information.  Suppose "Locations" allowed additional "IDs" elements.  In order to associate an ID record with its IDs parent, provide a relative reference to the parent ("../@Name") that will repeat the IDs field for each ID record.

It's natural to think top down when looking at a hierarchy.  However, for XML processors it may help to think bottom-up to identify the correct looping structure.  Parents -- and other ancestors -- aren't ignored in the bottom-up processing.  Access parent elements and attributes using relative (../) paths.

Namespaces Update

If your input XML uses namespaces and they can be ignored, then set the "Ignore namespaces" option on the tFileInputXML's Advanced settings tab.  This will produce a temp file of the XML data with all namespace definitions and prefixes stripped out.

No comments: