Tuesday, June 18, 2013

The Talend Open Studio Date Dimension Job


Use Java 1.6's Calendar class to build a rich date dimension using Talend Open Studio.

Java 1.6 has a new method getDisplayName() that will return text names and abbreviations for the different parts of a date. With some basic looping, this can produce a useful denormalized structure for a target like a table or spreadsheet when used in Talend Open Studio.

Here is a Job showing a set of components that routes a list of date records to a tLogRow.

Talend Open Studio: A Date Dimension Job
The Job contains 3 loops: YEAR_LOOP, MONTH_LOOP, and DAY_LOOP.  YEAR_LOOP loops over a numeric range, in this case 2010 to 2013.  MONTH_LOOP also loops over a numeric range, 0-11, for each month.  DAY_LOOP loops over the number of the days in the month, but to determine this, it needs some assistance.

Create Calendar is a tJava component that is called for each iteration of MONTH_LOOP, which in turn is called for each iteration of YEAR_LOOP.  For example, for the year '2011', Create Calendar will be called 12 times.  Create Calendar creates a Calendar object from the current year, current month, and the first day of the month (1).  Create Calendar then retrieves a special value computed by standard Java called actualMaximum.  This is the number of days in a particular month (28, 30, 31).  Create calendar then sets a global variable called MAXDAYS.

MAXDAYS is used as the upper bound of DAY_LOOP, which starts with 1, the first day of the month.  DAY_LOOP drives a tIterateToFlow which defines a 3-field schema: year, month, day.  The schema is filled with values taken from the CURRENT_VALUE of each of the three loops.  (Because MONTH_LOOP is zero-based, a 1 is added to it.)

the tFlowToIterate is directed to a tJavaRow component which creates another Calendar object based on the current year, month, and day.  "Lookup Cal Fields" routes the year, month, and day to the output AND adds several new fields.  These fields are month name, month abbreviation, day name, day abbreviation, and day in year ("Julian" date).

Here is the configuration for the components:

YEAR_LOOP (tLoop)

Loop Type: For
From: 2010
To: 2013
Step: 1
Values are increasing: checked

MONTH_LOOP (tLoop)

Loop Type: For
From: 0
To: 11
Step: 1
Values are increasing: checked

Create Calendar (tJava)
Import:

import java.util.Calendar;
import java.util.GregorianCalendar;


Code:

   Calendar cal = new GregorianCalendar(((Integer)globalMap.get("tLoop_1_CURRENT_VALUE")),     ((Integer)globalMap.get("tLoop_2_CURRENT_VALUE")), 1);
   globalMap.put("MAXDAYS", new Integer(cal.getActualMaximum(Calendar.DAY_OF_MONTH)));


DAY_LOOP (tLoop)

Loop Type: For
From: 1
Step: 1
Values are increasing: checked
To:

     ((Integer)globalMap.get("MAXDAYS")).intValue()



tIterateToFlow

Mapping (all Integer type):
Column / Value
year / ((Integer)globalMap.get("tLoop_1_CURRENT_VALUE"))
month / ((Integer)globalMap.get("tLoop_2_CURRENT_VALUE"))+1
day / ((Integer)globalMap.get("tLoop_3_CURRENT_VALUE"))

Here is the schema used in tIterateToFlow

Schema Used in tIterateToFlow

Lookup Cal Fields (tJavaRow)

Import:


import java.util.Locale;


Code:

output_row.year = input_row.year;
output_row.month = input_row.month;
output_row.day = input_row.day;

Calendar cal = new GregorianCalendar(input_row.year, input_row.month-1, input_row.day);

output_row.monthName = cal.getDisplayName(Calendar.MONTH, Calendar.LONG, Locale.getDefault());
output_row.monthAbbrv = cal.getDisplayName(Calendar.MONTH, Calendar.SHORT, Locale.getDefault());

output_row.dayName = cal.getDisplayName(Calendar.DAY_OF_WEEK, Calendar.LONG, Locale.getDefault());
output_row.dayAbbrv = cal.getDisplayName(Calendar.DAY_OF_WEEK, Calendar.SHORT, Locale.getDefault());

output_row.dayOfYear = new Integer(cal.get(Calendar.DAY_OF_YEAR));

Here is the schema used in the tJavaRow component (Lookup Calendar).  Note that several fields are added that are computed using Calendar off of the list indexes.

Schema Used in tJavaRow
With a few components, Talend Open Studio can create a job for loading a dimension.  In addition to the typical year/month/day values, there can be descriptive additions to the date dimension like month name or abbreviation.  Even if your date dimension needs to be supplemented with business-specific date like holidays, this script can serve as the starting point for an Excel spreadsheet.

No comments: