Tuesday, June 11, 2013

Updating a Table with the Talend Open Studio ELT Components


ELT stands for Extract-Load-Transform and is an inversion of the data processing pattern "ETL".  In ELT, data is manipulated within a database which is a huge performance improvement.  Talend Open Studio comes with components like tELTMySqlMap to make the writing of such manipulations easier.

When updating data in a database, sometimes its easier to load a file in a staging table and then manipulate the data within the database.  It may even be the only option based on performance requirements.  Using a staging table means loading relatively unclean data from a source like a spreadsheet or CSV file into a schema based heavily on the input.  Subsequent transformations and loads within the database will address the quality and move the data into a more usable format.

The Talend Open Studio (TOS) ELT components (for example, tELTMySqlMap and tELTOracle) will generate SQL INSERT OR UPDATE statements based on Talend metadata.  While these statements can be formed by hand, forming the SQL with Talend will cut down on typing and errors.  It's especially useful on a consulting job where large sets of unfamiliar columns need to be mapped.

 An ELT Job

This TOS job will update a table CONTACT.  CONTACT has a column BUSINESS_ID which will be updated based on a mapping established in the BUSINESS_CONTACT table.  The BUSINESS_CONTACT table uses a business key "BUSINESS_NAME" to retrieve the value BUSINESS_ID.

ELT Job
 BUSINESS and BUSINESS CONTACT are input tables to the tELTMySqlMap component.  The output table is CONTACT.  The schema defined for the CONTACT table in the tMySqlOutput component is a subset of the fields because this job will only be updating a single field.  For inserts, the whole set of fields would be used.

ELT Map

In the map component, the input tables are joined on a business key "BUSINESS_NAME".  From the BUSINESS table, BUSINESS_ID is mapped to the output.  If this were an insert operation, the job is complete.  Because this is an update, there is an additional step.

In an update, the input needs to be correlated with the output. This is accomplished using an alias 'C' on the target table.  The C alias is imposed as a condition on the join table using an expression entered by hand.  Notice that the target table is not added as a third input.  That's because the UPDATE statement generated will add input tables to the FROM cause, not what we want with an "UPDATE CONTACT" statement where the table name follows the UPDATE.

To set up the C alias, change the table name to include an alias in the tELTMySqlOutput component.

Adding an Alias to tELTMySqlOutput
 The result of a run generates the following SQL.

An Update Statement Considering the Output Table
Troubleshooting

If there is an error in the SQL generated by the Talend component, check the list of columns in the ETL Map.  Although this example only updates a single column, I have used an ELT component -- tMSSqlMap specifically -- to update more than one column.  What you're looking for in the list of columns are syntactic differences between a plain column (like 'dateCol') and a function or vendor-specific construct (like 'GETDATE()').  Try removing out anything that isn't a plain column when debugging.


 Sometimes the only way to process a large amount of data is to first stage the data in a loosely-defined and unconstrained table and to manipulate it once the data is in the database.  This is ELT (Extract-Load-Transform), an inversion of the more widely known "ETL".  Using Talend Open Studio's ELT helps generate the SQL for this type of transformation.  These SQL statements can certainly be created by hand, but if you're a consultant or dealing with an unfamiliar schema, the productivity boost of mapping unfamiliar tools with a GUI can be substantial.

No comments: