Thursday, June 13, 2013

Procedural Update in Talend Open Studio


For optimal performance, have the database do most of the work when updating.  Sometimes, this won't be feasible and you'll need to create loops and parameterize queries in a Talend Open Studio job.

SQL Update or Delete

To execute a single SQL UPDATE or DELETE statement, use the t*Row components in the Databases family: tOracleRow, tPostgresqlRow, tMysqlRow.  In the Query box in "Basic settings", enter a SQL string joined with variables taken from the globalMap of the job.

The t*Row components can be called as standalone subjobs, in an iteration, or as the target of a flow.

Procedural Coding

When it isn't practical to render a data transformation in SQL, use a procedure to more cleanly express the algorithm.  In Talend Open Studio, this is implemented using components like tLoop and tFlowToIterate.  These component break a data flow up -- say from an Access datasource -- into distinct iterations.  Each iteration can drive additional processing like keeping a counter of the records processed in a current iteration.

In this post, a table "UpdateTestSourceTable" is loaded with 5 records.  Each record contains a RECORD_ID which is a business record key that is not unique.  An auto-increment DB_RECORD_ID is used for uniqueness.  SEC_ROW is a column that defines ordering within a RECORD_ID.  Initially, this is empty.  The job in this post will set the value as in the screenshot.

Data Used in Example

This job implements the following procedural algorithm.  RECORD_ID is a business record key with meaning, but is not unique in the table.  DB_RECORD_ID is a surrogate key that guarantees uniqueness.

Get a list of distinct RECORD_IDS from a table
For each RECORD_ID 
    Initialize a COUNTER variable 
    Find a list of records matching RECORD_IDS (Get "DB_RECORD_ID")
    Map the fields
    Increment the COUNTER variable
    Update the table using COUNTER constrained by DB_RECORD_ID

The job uses a tAccessInput "Get Rec Ids" to start the processing.  tAccessInput goes into a tFlowToIterate.  For each iteration, the next five components are called.
Job Loading Records by Iteration

 Lookup Rec is a second tAccessInput that runs a query returning the distinct DB_RECORD_IDs.  It uses row1.record_id as a parameter.  Here is the SQL with the parameter.

SQL Parameterized with a tFlowToIterate's Variable
 The Update component is a tAccessOutput which has "Action on data" set to Update.  The schema is a subset of UpdateTestSourceTable: only db_record_id (key) and seq_row.

For learning about how to set up a counter in a Talend job, read Running Count in Talend Open Studio. "Set Sec Num" is set to 1 at the stat of each RECORD_ID iteration; this sets the variable "counter".  The code behind the tJavaRow increments the counter and maps each input field to the output.

output_row.db_record_id = input_row.db_record_id;
output_row.sec_row = input_row.sec_row;

int counter = ((Integer)globalMap.get("counter"));
globalMap.put("counter", ++counter);


For a database like Oracle RAC, nothing beats a set-based operation like MERGE.  However, it may be too difficult if not impossible to render a particular requirement in SQL.  In these cases, use the procedural facilities (tLoop, etc.) in Talend Open Studio to better interpret the algorithm.  Watch for performance!  If you do too much work in a single iteration, your job might not scale.

Additional Screen Shots

The following screen shots are for (in order) the configuration of "Get Rec Ids", "Lookup Rec", and "Update".

tAccessInput "Get Rec Ids" Config

tAccessInput "Lookup Rec" Config
 See the above screenshot "SQL Parameterized" for the embedded SQL statement in "Lookup Rec".

tAccessOutput "Update" Config
  The following is the tMap used in the job.

tMap Config
Excel Input (not Database) Walkthrough

In most cases, it's best to get the data into an RDBMS as quickly as possible and leave the flat files -- including Excel files -- saved for historical records.  The RDBMS gives the ultimate in flexibility and performance.  However, it is possible to do complex processing using Talend components like tAggregateRow or tFilterRow in place of SQL.

For a walkthrough of a job that uses only Excel (not a relational database) for an input source, watch the following video. 

http://www.youtube.com/watch?feature=player_embedded&v=8_XtGcQo7Xw#t=0s

Excel Output

This screenshow shows a job using a tExcelFileOutput.  It was created using 4.2.1.  The older format (excel2007 unchecked) was also tested.  The Component View is showing the configuration for the tExcelFileOutput component.

Job Using a tExcelFileOutput

No comments: