Thursday, June 13, 2013

Complex Joins Out of tMap in Talend Open Studio


The tMap component in Talend Open Studio (TOS) is a productive and efficient way to combine a main data flow with lookups.  However, the tMap is limited to join conditions based on equality.  For a more complex expression -- such as comparing a range of dates -- adjust the data so that the range turns in to a something join-able via a tMap: convert a single record with range 1-3 into three records 1,2,3.

If adjusting the data isn't possible, then push the complex join to the input component by expanding the SQL statement of the input component.

Basic tMap Lookup

tMap joins lookups to the main data flow using equality.  This "Load Fact Enroll" job joins a main MySQL data flow with two lookup tables.

A Job with Two Lookups
Here is the tMap configuration for Load Fact Enroll.  row1's SYSTEM_NM is joined to row2's SYSTEM_NM, bringing in the SYSTEM_ID column from row2.  row1's YEAR_NB is joined to row3's YEAR_NB bringing in the YEAR_ID column.  (The YEAR_ID lookup is for demonstration purposes and not very useful.)
Input Joined to Two Lookups
For best performance, the lookup should be unique and cached ("Load once").  There is an option to select a first element in case uniqueness isn't available.

The result of the map is a SQL join based on the following conditions (an equijoin is specified).

row1.SYSTEM_NM = row2.SYSTEM_NM AND
row1.YEAR_NB = row2.YEAR_NB


Date Range

These two tables are of different grain.  TransactionTable has records recorded at the day level (more realistically, they would be timestamped) and TransactionWeekTable has records recorded at the week level.

Daily Transactions
A single TransactionWeekTable record covers a range of dates. The table contains a record for each week of the year (as determined by the business user).  Although you can get the WEEK_OF_YEAR from a Talend function, this post assumes that WEEK_OF_YEAR doesn't match the business user's requirements.  (Is January 2, 2011 in Week 1 or Week 2?)

Weekly Reference Data
Rather than using the equijoin in the previous section, to get this to work the tMap would have to generate something like

row1.transaction_date >= row1.[Start Date] AND
row1.transaction_Date <= row2.[End Date]

Which isn't available using the simple drag-and-drop of the tMap interface.

Fix the Data

Once way to handle the complex join is to fix the data.  Kimball recommends building a complete, business-driven date to handle each day.  Each day can be allocated to a quarter or week, labeled as a holiday, or assigned a pay period.  In TransactionWeekTable, the fix would turn each range-based record into a set of records, one for each day.  For example,

Start Date     End Date    Week
-------------------------------- 
1/1/2011       1/8/2011    1

Would be replaced with a TransactionDayTable (or simply "CalendarTable")

Day      Day of Year Week   Holiday   Pay Day   Weekend
-------------------------------------------------------- 
1/1/2011 1           1      Y         N         Y
1/2/2011 2           1      N         N         Y
1/3/2011 3           1      N         N         N 
etc.

Then, tMap's standard lookup mechanism can be used because although the range join isn't supported, a simple date function to retrieve a day is available in the Routines (TalendDate.getPartOfDate).  Map the main flow column "transaction_date" to the new CalendarTable column "Day of Year" and edit the expression to use TalendDate.getPartOfDate("DAY_OF_YEAR) on the main flow column.

Here is an example spreadsheet from Kimball's web site to an example CalendarTable.

Move the Join

If you would like to fix this in the TOS code, then move the join to an input component like tAccessInput.  This job removes the lookup table from the canvas; it's now embedded in the tAccessInput query.

Lookup Pushed to Input Component
 The Guess schema function should be used to regenerate the schema based on the added columns.  The input source is no longer the pure TransactionTable but a hybrid of TransactionTable and TransactionWeekTable.

Here is the simple tMap supporting the job.

tMap with Extended Schema
This example is on calendars, but this technique can be applied in other contexts.  Product codes, account numbers, and certain NOT and EXISTS cases come to mind.
 
As with most integration cases, it's best to solve a data problem or deficiency in the data. In this example, breaking a week-oriented table out to a day-oriented table not only simplified the TOS design but also created a conduit for additional business requirements (quarter, pay period, etc.).  If you're not able to fix this problem in the data, then overcome Talend's equality-based tMap by adding joins in the SQL supporting input components.

No comments: