Tuesday, June 18, 2013

Combining Excel Spreadsheets with tUnite in Talend Open Studio


When you need to combine several Excel workbooks or sheets inside a workbook, use Talend Open Studio's tUnite component.

The tUnite component takes one or more inputs and outputs a data flow in the style of a SQL UNION..  If using Excel sources, and configured accordingly, the tUnite component will also output each sheet in the workbook in the data flow.

These Excel workbooks contain business contacts spread across two regions:BusinessContacts_West.xlsx and BusinessContacts_East.xlsx.  Additionally, there are sheets for each fiscal year in the workbook.

An Excel Workbook with Several Sheets
Configuring a component like tFileInputExcel to process each sheet will send all of a workbook's records in each sheet to the output data flow.  For example, the following component configuration is for a tFileInputExcel that references metadata from the repository.

Configuration for tFileInputExcel
This configuration specifies the file to process, BusinessContacts_West.xlsx,.  The configuration references previously-loaded metadata from the repository, marks all sheets for processing ("All sheets"), sets up a header ("Header 1"), and applies that header to each sheet ("Affect each sheet(header&footer)".  The Excel 2007 is also set.  (If you get an OLE error, be sure to verify the version.)

"Header 1" is important so that the headers repeated on each sheet aren't also repeated in the output data flow and mixed with the data.

The two workbooks are configured similarly.  Each workbook's main data flow is sent to a tUnite component which sends its output to a tLogRow (for demonstration purposes).
Sample tUnite Job
The tUnite configuration is straightforward.  It's using the same business contacts schema as the tFileInputExcels which is taken from the repository.
tUnite Configuration
And a run produces

Running the tUnite Example
tUnite provides a nice alternative to writing code that loops through workbooks and sheets.  You may need to write a loop if you need access to the iteration controls like the workbook name or a value encoded in the sheet name.  If this isn't the case, tUnite is preferred because configuration over code means less testing.

No comments: