Saturday, November 1, 2008

Creating a Multi Valued Attribute in Talend Open Studio

Creating a Multi Valued Attribute in Talend Open Studio

I think this blog post is going to be more relevant as NoSQL databases catch on where data is packed in columns.  The topic is how to create a multi-valued attribute -- a comma-separated list of values -- from a bunch of rows.
Input and Results

 A multi-valued attribute is a collection of values stored in a column.  In the RDBMS world, it's common to normalize this information, storing the collection in a join.  Such a table would look like this input text file.



A List of Services Codes Associated with a Hotel
Each row is identified by a business key, HotelId.  For each hotel, there can be one or more ServiceCodes.  For example, there are three services offered by HOTEL1: SWI, TEN, and DRY.

The desired output for the multi-valued attribute is to gather the service codes for a hotel, and output a single row for a hotel with all the services codes in one column.  The following results show the target output.  HOTEL1 started with 3 rows: HOTEL1, SWI; HOTEL1, TEN; HOTEL1, DRY.  The rows are converted to a single row with the service code list column "SWI,TEN,DRY".  The results demonstrate that the solution works across a range of inputs from a single value "DRY" for "HOTEL2" to the 4-record set of HOTEL5.

Hotels with Lists of Service Code Values
The solution I've come up with is pretty much pure Talend.  There is a little extra Java in the tMap, but there are no tJava components or Java Collections Framework classes used.

Job Design

The following job shows an input file being routed into a tMap.  The results are passed out to a tAggregateRow.  A tSortRow is used to improve the presentation.  Finally, a tLogRow prints the output, though any component could be used.

Job Producing Multi-Valued Attribute
Input File

This screenshot shows the configuration of the input text file.  Two columns -- HotelId and Service Code -- are passed along to the tMap.

Input File Configuration
tMap

Most of the functionality for the job is in the tMap.  The tMap is responsible for forming a comma-separated list of the service codes.  The result of the tMap is an intermediate work product that gathers the services codes into a running concatenation CurrServiceCodeList.  This variable is reset with each change in HotelId which is tracked with a state variable CurrHotelId.

The input must be sorted for this algorithm to work.

tMap Tracking Changes in Hotel Id Values
I've mapped several fields to 'out1', but only care about HotelId and CurrServiceCodeList.  I'm outputting everything from the tMap so that I can break the problem down.  The unwanted fields are filtered later.

tMap Variables

CurrHotelId is a variable and the most important part of the job.  It is referring to 'out1.HotelId' which is the output value.  However, that value carries over from the previous iteration (flows are compiled into iterations).  It's accessible in the computation of CurrServiceCodeList.

The screenshot does not show the variable expression for CurrServiceCodeList clearly, so it is printed below.

(!row1.HotelId.equals(Var.CurrHotelId))?row1.ServiceCode:Var.CurrServiceCodeList + "," + row1.ServiceCode  

If the input is different than CurrHotelId (including the first record), then CurrServiceCodeList is the one-element list row1.ServiceCode.  If the input HotelId is the same as CurrHotelId, then concatenation occurs.  Remember that the tMap is an intermediate product, so for those hotels with more than one record, there will still be more than one record output. However, the last of these records will have a CurrServiceCodeList that has accrued all of the values.

tAggregateRow


The tAggregateRow is used to step down the number of rows.  The tMap passes along intermediate values: HOTEL1: SWI, HOTEL1: SWI,TEN, HOTEL1: SWI,TEN,DRY.  But we only care about the record with all of the concatenated values.  The aggregation relies on a "max" computation on the strings to select one of the values over the others.  The MAX is taking the record with the longest ServiceCodeList, assuming that that's the one with all the values.

Stepping Down the Number of Rows with a tAggregate
Finally, a tSortRow is used because a side effect of the tAggregateRow is to scramble the order.  (Somewhere a HashMap or Hashtable is involved.)  A tLogRow outputs the results.

This is a Talend-centric solution.  While there is a complex Java expression in a tMap variable, the job itself is cordoned off from any complex coding.  This job kept track of the state of processing using a tMap variable.  An intermediate product was built and then adjusted later (tAggregateRow).  Something like this would work well with applications that store more in a column than a single data value, like a list of properties.

UPDATE:


This is a version that does away with the Java in the tMap Variables.  If you get a NullPointerException on the tDenormalizeSortedRow, make sure that a tSortRow precedes the component.

Multi Valued Attribute with tDenormalize