For optimal performance, have the database do most of the work when updating. Sometimes, this won't be feasible and you'll need to create loops and parameterize queries in a Talend Open Studio job.
SQL Update or Delete
To execute a single SQL UPDATE or DELETE statement, use the t*Row components in the Databases family: tOracleRow, tPostgresqlRow, tMysqlRow. In the Query box in "Basic settings", enter a SQL string joined with variables taken from the globalMap of the job.
The t*Row components can be called as standalone subjobs, in an iteration, or as the target of a flow.
Procedural Coding
When it isn't practical to render a data transformation in SQL, use a procedure to more cleanly express the algorithm. In Talend Open Studio, this is implemented using components like tLoop and tFlowToIterate. These component break a data flow up -- say from an Access datasource -- into distinct iterations. Each iteration can drive additional processing like keeping a counter of the records processed in a current iteration.
In this post, a table "UpdateTestSourceTable" is loaded with 5 records. Each record contains a RECORD_ID which is a business record key that is not unique. An auto-increment DB_RECORD_ID is used for uniqueness. SEC_ROW is a column that defines ordering within a RECORD_ID. Initially, this is empty. The job in this post will set the value as in the screenshot.
This job implements the following procedural algorithm. RECORD_ID is a business record key with meaning, but is not unique in the table. DB_RECORD_ID is a surrogate key that guarantees uniqueness.
Get a list of distinct RECORD_IDS from a table
For each RECORD_ID
Initialize a COUNTER variable
Find a list of records matching RECORD_IDS (Get "DB_RECORD_ID")
Map the fields
Increment the COUNTER variable
Update the table using COUNTER constrained by DB_RECORD_ID
The job uses a tAccessInput "Get Rec Ids" to start the processing. tAccessInput goes into a tFlowToIterate. For each iteration, the next five components are called.
Lookup Rec is a second tAccessInput that runs a query returning the distinct DB_RECORD_IDs. It uses row1.record_id as a parameter. Here is the SQL with the parameter.
The Update component is a tAccessOutput which has "Action on data" set to Update. The schema is a subset of UpdateTestSourceTable: only db_record_id (key) and seq_row.
For learning about how to set up a counter in a Talend job, read Running Count in Talend Open Studio. "Set Sec Num" is set to 1 at the stat of each RECORD_ID iteration; this sets the variable "counter". The code behind the tJavaRow increments the counter and maps each input field to the output.
output_row.db_record_id = input_row.db_record_id;
output_row.sec_row = input_row.sec_row;
int counter = ((Integer)globalMap.get("counter"));
globalMap.put("counter", ++counter);
For a database like Oracle RAC, nothing beats a set-based operation like MERGE. However, it may be too difficult if not impossible to render a particular requirement in SQL. In these cases, use the procedural facilities (tLoop, etc.) in Talend Open Studio to better interpret the algorithm. Watch for performance! If you do too much work in a single iteration, your job might not scale.
Additional Screen Shots
The following screen shots are for (in order) the configuration of "Get Rec Ids", "Lookup Rec", and "Update".
See the above screenshot "SQL Parameterized" for the embedded SQL statement in "Lookup Rec".
The following is the tMap used in the job.
Excel Input (not Database) Walkthrough
In most cases, it's best to get the data into an RDBMS as quickly as possible and leave the flat files -- including Excel files -- saved for historical records. The RDBMS gives the ultimate in flexibility and performance. However, it is possible to do complex processing using Talend components like tAggregateRow or tFilterRow in place of SQL.
For a walkthrough of a job that uses only Excel (not a relational database) for an input source, watch the following video.
http://www.youtube.com/watch?feature=player_embedded&v=8_XtGcQo7Xw#t=0s
Excel Output
This screenshow shows a job using a tExcelFileOutput. It was created using 4.2.1. The older format (excel2007 unchecked) was also tested. The Component View is showing the configuration for the tExcelFileOutput component.
SQL Update or Delete
To execute a single SQL UPDATE or DELETE statement, use the t*Row components in the Databases family: tOracleRow, tPostgresqlRow, tMysqlRow. In the Query box in "Basic settings", enter a SQL string joined with variables taken from the globalMap of the job.
The t*Row components can be called as standalone subjobs, in an iteration, or as the target of a flow.
Procedural Coding
When it isn't practical to render a data transformation in SQL, use a procedure to more cleanly express the algorithm. In Talend Open Studio, this is implemented using components like tLoop and tFlowToIterate. These component break a data flow up -- say from an Access datasource -- into distinct iterations. Each iteration can drive additional processing like keeping a counter of the records processed in a current iteration.
In this post, a table "UpdateTestSourceTable" is loaded with 5 records. Each record contains a RECORD_ID which is a business record key that is not unique. An auto-increment DB_RECORD_ID is used for uniqueness. SEC_ROW is a column that defines ordering within a RECORD_ID. Initially, this is empty. The job in this post will set the value as in the screenshot.
Data Used in Example |
This job implements the following procedural algorithm. RECORD_ID is a business record key with meaning, but is not unique in the table. DB_RECORD_ID is a surrogate key that guarantees uniqueness.
Get a list of distinct RECORD_IDS from a table
For each RECORD_ID
Initialize a COUNTER variable
Find a list of records matching RECORD_IDS (Get "DB_RECORD_ID")
Map the fields
Increment the COUNTER variable
Update the table using COUNTER constrained by DB_RECORD_ID
The job uses a tAccessInput "Get Rec Ids" to start the processing. tAccessInput goes into a tFlowToIterate. For each iteration, the next five components are called.
Job Loading Records by Iteration |
Lookup Rec is a second tAccessInput that runs a query returning the distinct DB_RECORD_IDs. It uses row1.record_id as a parameter. Here is the SQL with the parameter.
SQL Parameterized with a tFlowToIterate's Variable |
For learning about how to set up a counter in a Talend job, read Running Count in Talend Open Studio. "Set Sec Num" is set to 1 at the stat of each RECORD_ID iteration; this sets the variable "counter". The code behind the tJavaRow increments the counter and maps each input field to the output.
output_row.db_record_id = input_row.db_record_id;
output_row.sec_row = input_row.sec_row;
int counter = ((Integer)globalMap.get("counter"));
globalMap.put("counter", ++counter);
For a database like Oracle RAC, nothing beats a set-based operation like MERGE. However, it may be too difficult if not impossible to render a particular requirement in SQL. In these cases, use the procedural facilities (tLoop, etc.) in Talend Open Studio to better interpret the algorithm. Watch for performance! If you do too much work in a single iteration, your job might not scale.
Additional Screen Shots
The following screen shots are for (in order) the configuration of "Get Rec Ids", "Lookup Rec", and "Update".
tAccessInput "Get Rec Ids" Config |
tAccessInput "Lookup Rec" Config |
tAccessOutput "Update" Config |
tMap Config |
In most cases, it's best to get the data into an RDBMS as quickly as possible and leave the flat files -- including Excel files -- saved for historical records. The RDBMS gives the ultimate in flexibility and performance. However, it is possible to do complex processing using Talend components like tAggregateRow or tFilterRow in place of SQL.
For a walkthrough of a job that uses only Excel (not a relational database) for an input source, watch the following video.
Excel Output
This screenshow shows a job using a tExcelFileOutput. It was created using 4.2.1. The older format (excel2007 unchecked) was also tested. The Component View is showing the configuration for the tExcelFileOutput component.
Job Using a tExcelFileOutput |
No comments:
Post a Comment