The tMap component in Talend Open Studio (TOS) is a productive and efficient way to combine a main data flow with lookups. However, the tMap is limited to join conditions based on equality. For a more complex expression -- such as comparing a range of dates -- adjust the data so that the range turns in to a something join-able via a tMap: convert a single record with range 1-3 into three records 1,2,3.
If adjusting the data isn't possible, then push the complex join to the input component by expanding the SQL statement of the input component.
Basic tMap Lookup
tMap joins lookups to the main data flow using equality. This "Load Fact Enroll" job joins a main MySQL data flow with two lookup tables.
Here is the tMap configuration for Load Fact Enroll. row1's SYSTEM_NM is joined to row2's SYSTEM_NM, bringing in the SYSTEM_ID column from row2. row1's YEAR_NB is joined to row3's YEAR_NB bringing in the YEAR_ID column. (The YEAR_ID lookup is for demonstration purposes and not very useful.)
For best performance, the lookup should be unique and cached ("Load once"). There is an option to select a first element in case uniqueness isn't available.
The result of the map is a SQL join based on the following conditions (an equijoin is specified).
row1.SYSTEM_NM = row2.SYSTEM_NM AND
row1.YEAR_NB = row2.YEAR_NB
Date Range
These two tables are of different grain. TransactionTable has records recorded at the day level (more realistically, they would be timestamped) and TransactionWeekTable has records recorded at the week level.
A single TransactionWeekTable record covers a range of dates. The table contains a record for each week of the year (as determined by the business user). Although you can get the WEEK_OF_YEAR from a Talend function, this post assumes that WEEK_OF_YEAR doesn't match the business user's requirements. (Is January 2, 2011 in Week 1 or Week 2?)
Rather than using the equijoin in the previous section, to get this to work the tMap would have to generate something like
row1.transaction_date >= row1.[Start Date] AND
row1.transaction_Date <= row2.[End Date]
Which isn't available using the simple drag-and-drop of the tMap interface.
Fix the Data
Once way to handle the complex join is to fix the data. Kimball recommends building a complete, business-driven date to handle each day. Each day can be allocated to a quarter or week, labeled as a holiday, or assigned a pay period. In TransactionWeekTable, the fix would turn each range-based record into a set of records, one for each day. For example,
Start Date End Date Week
--------------------------------
1/1/2011 1/8/2011 1
Would be replaced with a TransactionDayTable (or simply "CalendarTable")
Day Day of Year Week Holiday Pay Day Weekend
--------------------------------------------------------
1/1/2011 1 1 Y N Y
1/2/2011 2 1 N N Y
1/3/2011 3 1 N N N
etc.
Then, tMap's standard lookup mechanism can be used because although the range join isn't supported, a simple date function to retrieve a day is available in the Routines (TalendDate.getPartOfDate). Map the main flow column "transaction_date" to the new CalendarTable column "Day of Year" and edit the expression to use TalendDate.getPartOfDate("DAY_OF_YEAR) on the main flow column.
Here is an example spreadsheet from Kimball's web site to an example CalendarTable.
Move the Join
If you would like to fix this in the TOS code, then move the join to an input component like tAccessInput. This job removes the lookup table from the canvas; it's now embedded in the tAccessInput query.
The Guess schema function should be used to regenerate the schema based on the added columns. The input source is no longer the pure TransactionTable but a hybrid of TransactionTable and TransactionWeekTable.
Here is the simple tMap supporting the job.
This example is on calendars, but this technique can be applied in other contexts. Product codes, account numbers, and certain NOT and EXISTS cases come to mind.
As with most integration cases, it's best to solve a data problem or deficiency in the data. In this example, breaking a week-oriented table out to a day-oriented table not only simplified the TOS design but also created a conduit for additional business requirements (quarter, pay period, etc.). If you're not able to fix this problem in the data, then overcome Talend's equality-based tMap by adding joins in the SQL supporting input components.
If adjusting the data isn't possible, then push the complex join to the input component by expanding the SQL statement of the input component.
Basic tMap Lookup
tMap joins lookups to the main data flow using equality. This "Load Fact Enroll" job joins a main MySQL data flow with two lookup tables.
A Job with Two Lookups |
Input Joined to Two Lookups |
The result of the map is a SQL join based on the following conditions (an equijoin is specified).
row1.SYSTEM_NM = row2.SYSTEM_NM AND
row1.YEAR_NB = row2.YEAR_NB
Date Range
These two tables are of different grain. TransactionTable has records recorded at the day level (more realistically, they would be timestamped) and TransactionWeekTable has records recorded at the week level.
Daily Transactions |
Weekly Reference Data |
row1.transaction_date >= row1.[Start Date] AND
row1.transaction_Date <= row2.[End Date]
Which isn't available using the simple drag-and-drop of the tMap interface.
Fix the Data
Once way to handle the complex join is to fix the data. Kimball recommends building a complete, business-driven date to handle each day. Each day can be allocated to a quarter or week, labeled as a holiday, or assigned a pay period. In TransactionWeekTable, the fix would turn each range-based record into a set of records, one for each day. For example,
Start Date End Date Week
--------------------------------
1/1/2011 1/8/2011 1
Would be replaced with a TransactionDayTable (or simply "CalendarTable")
Day Day of Year Week Holiday Pay Day Weekend
--------------------------------------------------------
1/1/2011 1 1 Y N Y
1/2/2011 2 1 N N Y
1/3/2011 3 1 N N N
etc.
Then, tMap's standard lookup mechanism can be used because although the range join isn't supported, a simple date function to retrieve a day is available in the Routines (TalendDate.getPartOfDate). Map the main flow column "transaction_date" to the new CalendarTable column "Day of Year" and edit the expression to use TalendDate.getPartOfDate("DAY_OF_YEAR) on the main flow column.
Here is an example spreadsheet from Kimball's web site to an example CalendarTable.
Move the Join
If you would like to fix this in the TOS code, then move the join to an input component like tAccessInput. This job removes the lookup table from the canvas; it's now embedded in the tAccessInput query.
Lookup Pushed to Input Component |
Here is the simple tMap supporting the job.
tMap with Extended Schema |
As with most integration cases, it's best to solve a data problem or deficiency in the data. In this example, breaking a week-oriented table out to a day-oriented table not only simplified the TOS design but also created a conduit for additional business requirements (quarter, pay period, etc.). If you're not able to fix this problem in the data, then overcome Talend's equality-based tMap by adding joins in the SQL supporting input components.
No comments:
Post a Comment