Tuesday, June 11, 2013

XML Output from Multiple Data Sources with Talend Open Studio


For XML documents with strong hierarchical structure, use tAdvancedFileOutputXML in Talend Open Studio  to map each source field to a target element or attribute.  If the XML document is less cohesive -- there are several data sets related only by parent -- use the tFileOutputMSXML component.

When you're forming an XML document using Talend Open Studio and the XML has multiple loops, use the tFileOutputMSXML component.  tFileOutputMSXML lets you map several input data flows to their own copy of the root element; this results in multiple loops, one per data flow.  This is different than tAdvancedFileOutputXML which relies on a single input to define a single loop.

Target Schema

Consider this graphical representation of an XSD 'cruise-ports.xsd'.  This blog post will walkthrough a Talend Open Studio job that will output an XML document that adheres to the target schema.

cruise-ports.xsd
A top-level element 'cruise-ports' contains one or more cruise-ports.  Within the cruise-port, there are two subelements, cruise-line and snack-shop, that may have different cardinalities.  For example, a cruise-port may have 4 cruise-lines, but only 2 snack-shops.

The XSD is here.

Basic Job Structure

The basic job structure for working with tFileOutputMSXML is to define connect each input source to the tFileOutputMSXML component.  Unlike tAdvancedFileOutputXML, MSXML can take more than one main.

MSXML Output Job
The input sources are two text files.  'baltimore-cruise-lines.txt' is a 5 line text file with 1 line containing the headers.  'baltimore-snack-shops.txt' is a 3 line text file with 1 line containing the headers.

# baltimore-cruise-lines.txt
Name;Destinations
Carnival;Bahamas,Mexico,Bermuda
Holland America;Mexico,Puerto Rico
Royal Caribbean;Mexico,Trinidad,Togabo
Norwegian;Norway,Sweden,Finland,Russia,Netherlands


# baltimore-snack-shops.txt
Name;Hours of Operations
Joe's Coffee Stand;S,M,T,W,Th,F,S 6am-12pm
McDonald's;S,M,T,W,Th,F,S 6am-9pm

tFileOutputMSXML Config

The following procedure configures the tFileOutputMSXML component

  1. Rename both copies of the default top-level element 'rootTag' to 'cruise-ports'
  2. On each copy, right-click and import an XML tree based on cruise-ports.xsd 
  3. For the copy associated with 'row1', remove the snack-shop elements
  4. For the copy associated with 'row2', remove the cruise-line elements
  5. Map the fields for row1 (Name, Destinations)
  6. Map the fields for row2 (Name, Hours of Operation)
The following screen shot shows the completed configuration

tFileOutputMSXML Config
Result

The result of the run is the following XML

Resulting XML
Namespaces Warning

I received a number of errors when working with namespaces and the Talend issue navigator has 29 DI unresolved issues (11-SEP-25) regarding namespaces and tFileOutputMSXML.  If namespaces are important for your particular requirement -- and namespaces are a crucial to any composable XML modeling -- this example won't work for you without some type of post-processing that will insert a namespace prefix and top-level attribute.

You can slip in a default namespace using the 'add namespace' feature if all the elements are under the same namespace.

Multiple Loops (Thanks "Rock")

If your XML document contains multiple looping elements, you can use several tAdvancedFileOutputXML components to build up the output in sections.  For each input component, create a tAdvancedFileOutputXML starting with the topmost element.  Each child element's tAdvancedFileOutputXML will use the "Append the source xml file" option.

These three data files are joined under the 'dept_no' identifier.  In this data model, a Department (depts.txt) contains Employees (emps.txt) and Printers (printers.txt).  There is no correlation between Employees and Printers, except for their parent Department.

depts.txt
------------------------------
dept_no,dept_name
100,Accounting
101,IT

emps.txt
------------------------------
dept_no,emp_no,emp_first_name
100,2000,joe
100,2001,carl
101,2002,steve

printers.txt
------------------------------
dept_no,printer_name
100,hp-acct-bw
100,hp-acct-color
101,hp-it-bw
101,hp-it-color
101,epson-plotter



In processing terms, there will be 3 loops, a loop building Employees, a loop building Printers, and a loop building the containing Departments.  These loops will be implemented using three distinct tAdvancedFileOutputXMLs.

Job With Multiple Loops

 The expected output of the job is a top-level set of Departments containing the Department's related Employees and Printers.


   
   
     
       
       
     

     
       
       
     

     

     
     
      
       
      
     

     
       
     

     

   


All tAdvancedFileOutputXMLs in this job write to the same XML file.  tAdvancedFileOutputXML_3 and _5 have the   "Append the source xml file" option set.

XML Component for Departments
XML Component for Employees
XML Component for Printers
In the input, each data file contains a dept_no and that field is mapped to /depts/dept/@dept_no in each tAdvancedFileOutputXML.  This associates children (Employees and Printers) with the parent Department.

Special Field Processing on Single Data Source

Another application of this technique is when normalization is needed on more than one field.  Take the following input file as an example.  The input file has two multi-valued attribute columns: CITY and COLOR.

NAME;CITY;COLOR
TEST1;PARIS,LONDON;RED,GREEN,BLUE
TEST2;PARIS,BOSTON;YELLOW,GREEN,BLUE


The result of processing this file will be an XML document containing repeating groups of CITY and COLOR values.  The key to this processing is to define 2 loops using 2 tFileInputDelimited components on the same input.  One loop expands CITY, the other, COLOR.  A component like tReplicate won't work in this case because it doesn't render more than one loop.

2 Input Paths Using tNormalize
tAdvancedFileOutputXML_3 uses the "Append to xml source file" option to continue processing from _1.

This is the mapping for tAdvancedFileOutputXML_1.

XML Output Component with CITY Mapped
Note that COLOR is not mapped.  A COLORS element is added to hold the place of the COLOR sub-element created in the _3 component.

Here is the mapping for the tAdvancedFileOutputXML_3 component.  Note that CITY is not mapped.

XML Component with COLOR Mapped

For an XML document based on a single input, use the tAdvancedFileOutputXML.  tAdvancedFileOutputXML will also support grouping.  If you need more than one loop -- say there are lists of unrelated children elements --  use more than one tAdvancedFileOutputXML component.  For disjoint data sets, try tFileOutputMSXML.  If namespace support is required, you will need additional processing or another technique to add them to your document.

No comments: