Tuesday, June 11, 2013

Migrating Legacy Data with Talend Open Studio's tSplitRow


In the life of a database, columns are added to tables to support new data elements.  When a pattern of similar columns develops, it may be worthwhile to move the repeating group to a separate table.  Talend Open Studio's tSplitRow component turns each column into its own row.

Contact information in databases changes with new technologies.  "Email" and "web site" were added as  standard contact fields in the 90's.  Now, contacts can have Twitter handles, Facebook user names, YouTube channels and multiple email addresses (Google Circles, etc.).

Software applications can't always keep up with the new technologies.  Accounting packages like Sage Peachtree will provide custom fields for handling this type of thing.  An email address can be stored in Peachtree Accounting 2012's contact information, but you'll need to use a custom field to store a Twitter handle.

Legacy Data

The following data set is inspired from such a scenario.  The screenshot is taken from Talend Open Studio's (TOS) tFileInputExcel wizard.  It shows three contact records with a mandatory Email address and three custom fields (Custom_Field1, Custom_Field2, CustomF_Field3) which may contain additional email addresses.
tFileInputExcel Wizard Showing Legacy Data
Migration

The target schema will move the repeating email columns into a single table.  Each email column (Email, Custom Field 1) will be realized in its own record in the target.  Empty email columns will not produce a target record.

The following screenshot shows a TOS job producing the desired output.  Three records are read-in; nine are output by the tSplitRow.  Each email field is saved in a target record (3 email fields x 3 input records = 9 output records).  A tFilterRow is added to remove records that have an empty email field.  Only the first contact record has its Custom Field 2 set, so two records are filtered.

TOS Job Showing Converted Output
tSplitRow Config

The tSplitRow component is configured by defining three mappings.  First_Name and Last_Name are repeated for each mapping.  Each of the three email columns (Email, Custom Field 1, Custom Field 2) differentiate the column mappings.

tSplitRow Column Mappings
Moving a set of columns from one table into its own table is a common data model change.  Often, systems like Peachtree Accounting rely on custom fields to record new data elements.  If this is the case, watch for data quality problems like missing values or incorrect column usage because there aren't stringent validations on these type of fields.

No comments: