Thursday, June 13, 2013

Dynamic Schemas in Talend Open Studio


When working with input data from a source like a spreadsheet, which is subject to editing by users, you can't always count on a strict format.  For instance, the columns may be reordered.  But if the spreadsheet contains column headers that are still accurate, you can map it to a schema using Talend Open Studio.

In Talend Open Studio (TOS), you define schemas to describe the purpose and data types for a source or target.  With flat files or spreadsheets, TOS is strict about associating the column position with a particular field definition.  If the first field is designated as "First Name" in the File Excel wizard, the input is expected to contain "First Name" in its first field despite what the column header says.


To build a schema on-the-fly, take two passes on the input file.  First, record the header information in a data structure to be used in later processing.  For each input record, consult the data structure.

  The Talend Exchange has a good example of this in "Example job for the dynamic schema routine" at http://www.talendforge.org/exchange/tos/extension_view.php?eid=177 (rbaldwin).  The job uses Java Reflection wrapped up in a Talend Routine.

The following job is an alternative example that is identical in its algorithm, but varies in the Java and the set of Talend Components.

A Talend Job Reading a Header Prior to Processing

Algorithm

The file is processed in three stages

  1. Initialize a data structure
  2. Read in and record the header
  3. For each input record, map the fields to the correct schema using the header data structure
The header data structure is a java.util.Map.  This is created and associated with the globalMap in a tSetGlobalVar.  There are also Java imports in the Advanced settings for java.util.* and java.lang.reflect.*.

The tSetGlobalVar creates the java.util.Map.  In the "Basic settings" of the tSetGlobalVar, add an entry with key "fieldMap".  This will reference the following object.

  new HashMap()

"fieldMap" should be quoted.  Don't quote the new command.

Record the Header

The header is read in using a tFileInputDelimited which is configured to stop after the first record (Limit=1).  The tFileInputDelimited is based on a loose schema where each field is a type and there is one field for each possible input column.  The names of the fields are left generic so that they aren't mislabed if a different column order is used.

A tJavaRow calls the put() method on the header Map.  The key of the map is the true field name, taken from the first line of the input file.  The value of the map is the corresponding column name.  For example, if the first column header is "Field_B" and the first field of the loose tFileInputDelimited schema is "Column0", then Field_B/Column0 is recorded.

Here is the Java Reflection code used in the tJavaRow_3 component.  For each field in the input row, record the first line's text (it's only reading in one line) and the column name of the loose schema.

Map fieldMap = (Map)globalMap.get("fieldMap");

for( Field f : row1.getClass().getFields() ) {
   fieldMap.put((String)f.get(row1), f.getName());
}


Input Processing

The file is processed in a second pass.  The same tFileDelimited component is used (same file, same schema).  However, the second pass will not use Limit=1, but will use Header=1.  This skips the header and reads in the rest of the file.

This job uses a tMap to establish the target schema.  The tMap controls the eventual ordering of the columns.
Map Using Dynamic Column Values

 Vars are used to reduce the amount of Java code in the target mapping.  Each Var reads in the mapped values of the fieldMap using the target field name as a key.  The mapped values are used in the OUT schema. 

Video

This is a video walkthrough of the dynamic schema job.

http://www.youtube.com/embed/bLq8ip6nH5g?feature=player_embedded

From this example, and rbaldwin's on Talend Exchange, you can see that it's feasible to accept input files where the column order is rearranged.  However, you will need to work with Java (or Perl) to extend the standard File Delimited or File Excel wizards.