Tuesday, June 11, 2013

Dynamically Filtering Columns for XML with Talend Open Studio


To filter columns when producing an XML document with Talend Open Studio, use the tFileOutputMSXML component with the "Create empty element if needed" checkbox unset.

Talend Open Studio's tFileOutputMSXML component creates an XML document from an input schema.  This component uses a map dialog to link the input fields to XML elements and attributes.  On the Advanced settings, there is an option that prevents XML elements from being produced if there is no value.  This setting can be used to filter columns.

Remove Empty Elements from Null Input
Using an Excel input and an XML output, define a schema that will include all the columns.  For example, this Excel file has four columns: brand, type, price, and shelf.  Note that "shelf" does not have a value for "sunkist".
Excel Spreadsheet
Create a job that uses a tFileInputExcel component run into a tJavaRow which is run into a tFileOutputMSXML.  Base the tFileInputExcel component's schema on all available columns.  Synchronize all schemas using the "Sync columns" button on the tJavaRow.

Job with Context Variable for Allowed Columns
Create a context variable "OUTPUT_COLUMNS" that is a comma-separated list of column names used in the schema.  These names must match the column names used throughout the job.

Create the mappings for the XML document.  Define a root container tag (sodas) and loop tags (soda).  Map each available column as its own subelement.  Each subelement must retain its input source column name.

Mapping for tFileOutputMSXML
Finally, add in the Java code for the tJavaRow.  This code loops over the list of column names provided by the context variable.  Java Reflection is used to dynamically map the input and output fields.  This avoids having to hardcoded the names using "input_row.brand".  On Advanced settings, java.lang.reflect.Field must be imported.


Class c_in = input_row.getClass();
Class c_out = output_row.getClass();

if( context.OUTPUT_COLUMNS != null ) {

  String[] fieldNames =
     context.OUTPUT_COLUMNS.split(",");

  for( String fn : fieldNames ) {

    // input / output schemas must match

    Field f_in = c_in.getDeclaredField(fn);
    Field f_out = c_out.getDeclaredField(fn);

    if( f_in.get(input_row) == null ) {
      f_out.set(output_row, "");
    } else {
      f_out.set(output_row, f_in.get(input_row));
    }
  }
}



The check for null is handling the case where you want an empty element to show up in the XML.  "shelf" is not filled in for "sunkist", but it should show up if "shelf" is an included column.

With Java Reflection, you can build a filtering mechanism that will produce an XML document tailored to your particular requirement.  Build comprehensive source and target schemas and use a tJavaRow and a setting on the tFileOutputMSXML component to apply the filtering logic.

No comments: