Tuesday, June 11, 2013

Three Levels XML Nesting in Talend Open Studio


To output a complex XML document with multiple nested structures, first write a denormalized query then route the main flow to Talend Open Studio's tAdvancedFileOutput component.

Talend Open Studio's tAdvancedFileOutput component builds an XML tree structure and binds the input schema to the structure's XML elements and attributes.  For complex XML documents, tAdvancedFileOutput also defines loops and groups to form nested elements.

This post describes an example of an XML document with three-levels of nesting.

Denormalized Data Set

In the following E-R diagram, an ITEM may contain one or more ITEM_EANS.  ITEM_EANS may contain one or more SUPPLIERS.

E-R Showing ITEMS, EAN, and SUPPLIERS
The first step in producing an XML document in Talend Open Studio is to write a denormalizing query that will retrieve all of the data for the XML document.  A denormalizing query is one that will produce repeating groups in a result set.

This SQL is such a query.

SELECT
ITEM.ITEMID, ITEM.ITEMNAME,
ITEM_EAN.ITEM_EAN_ID,
SUPPLIER.SUPPLIERID, SUPPLIER.SUPPLIER_NAME
FROM (ITEM INNER JOIN ITEM_EAN ON ITEM.ITEMID = ITEM_EAN.ITEMID)
INNER JOIN SUPPLIER ON ITEM_EAN.ITEM_EAN_ID = SUPPLIER.ITEM_EAN_ID
ORDER BY ITEM.ITEMID, ITEM_EAN.ITEM_EAN_ID, SUPPLIER.SUPPLIERID;

And the result may look like what is captured in this screenshot.

A Denormalized Data Set
Loops and Groups

This screenshot shows a simple 2 component Talend Open Studio job that runs the query listed above against an Access database.  The main flow of the tAccessInput component is directed to a tAdvancedFileOutputXML component.

Simple XML-Producing Job

The configuration of the tAdvancedFileOutput begins with identifying the lowest level grain in the query's result set.  This may seem backwards, as XML modeling tends to start with the toplevel element and work downward.  In this example, the lowest level grain is the SUPPLIER which is "owned" by the ITEM_EAN which in turn is owned by the ITEM.

After the loop element is identified, any groups are flagged.  This will avoid taking the denormalized result set and producing a similarly denormalized XML document.  If you only define a loop element in this example, then there would be as many ITEMS as there are SUPPLIERS rather than the nested structure we desire.

See the following snippet for XML produced without groups.


 
   
     
   
 
 
   
     
   
 
 
   
     
   
 

This screenshot shows the correct configuration of tAdvancedFileOutputXML.  Note the Loop element definition on SUPPLIER and the group annotations on the ITEM_EAN and ITEM elements.

tAdvancedFileOutputXML Config
Result

The result is a nested document with 3 toplevel ITEM elements.




   
     
       
     

     
       
     

     
       
     

   

   
     
       
     

   

   
     
       
       
     

   
 

XML Note

The XML document is built with a particular transmission in mind.  In this instance, the document assumes that a list of ITEMS has (through EAN) a list of SUPPLIERS.  The condition could be inverted with additional relationships.  Suppose that, in this case, this is a preferred vendor list where the document receiver is looking for SUPPLIERS that provide items.  In a different context or with different relationships, this could be inverted.  Take a catalog of ITEMS offered by SUPPLIERS.

Another Post

For another example of using tAdvancedFileOutputXML, visit Nested XML From an RDBMS with Talend Open Studio 

No comments: