Tuesday, June 18, 2013

Excel Workbook Processing with Talend Open Studio


To process Excel files in a loop that iterates over a set of files and the sheets within the files, use the tFileList and tFileInputExcel components along with a few global variables.

Processing input files often involves looking in a directory and iterating through a set of files that match a particular filename pattern (for example, *.xlsx for Excel 2007 files).  There may be additional processing required if each Excel file is a Workbook with more than one Sheet.

This example uses the tFileList and tFileInputExcel components to create a double-loop of processing.  The outer loop iterates over the files in a directory, defined as a global variable using a ContextGroup and Context.  The inner loop iterates over each Sheet in the Workbook.  As the files are processed, both the Excel filename and the Sheet name are used in the data loading.  That is, the filename and the sheet name contain encoded business values that aren't found in the rows and columns, but that needs to be loaded in the database.

In this tutorial, an Operating Region is embedded in the filename.  So, "exceltest_west.xlsx" results in "west" being used for that column.  Sales Region is embedded in the Sheet name.  A Sheet named "New England" would provide the values for that column.

Start off by creating a DATA_DIR variable in the context.  Create a ContextGroup matching the Job name.  Add a DATA_DIR variable.

Talend Open Studio - Create Context Group and Variable

Then, add a default value for the variable, DATA_DIR.  In practice, there might be Contexts created for each environment.
Talend Open Studio - Set Variable Value

Drag the ContextGroup onto the Job so that DATA_DIR is available.

Next, add the four components: tFileList, tFileInputExcel, tMap, and tFileOutputDelimited (append mode).

Talend Open Studio - Add Components

Configure the components.  Use the following as a guide for configuring the tFileList component.

Talend Open Studio - Configure tFileList


Configure the tFileInputExcel component.  The most important part of this configuration is to use both the DATA_DIR variable and the CURRENT_FILE_PATH as the File name/Stream.  CURRENT_FILE_PATH comes from the tFileList component and is prepended with "tFileList_1_".

A tMap component is used to produce a delimited text file.  All of the fields in the spreadsheet appear in the tMap and are dragged to the empty schema of the text file.  Two extra fields are added: OperatingRegion and Sales Region.



Talend Open Studio - Map Fields

Here is the expression used for OperatingRegion. It uses some Java String functions to break off the suffix (.xlsx) and the first part (exceltest_).

((String)globalMap.get("tFileList_1_CURRENT_FILE")).split("\\.xlsx")[0].split("_")[1] 


split() is a Java function and the above statement isn't particularly elegant.  This can be cleaned up using a library like Jakarta Commons Lang.  Look for StringUtils.substringBetween() to grab the middle of a String without the brittle indexing.  Take a look at this blog post for instructions on how to bring this functionality into Talend.

StringUtils.substringBetween( (String)globalMap.get("tFileList_1_CURRENT_FILE"), "exceltest_", ".xlsx")


Here is the expression used for SalesRegion.

(String)globalMap.get("tFileInputExcel_1_CURRENT_SHEET")


It's not uncommon for an Excel file to be contain special values encoded in the filename or the Sheet name.  A few global variables set by components during processing can capture these.  This post used _CURRENT_FILE, a variable from tFileInputList, and _CURRENT_SHEET, from tFileExcelInput.  The available globals are listed in the documentation, but to get the exact syntax, use auto complete.

No comments: