Tuesday, June 11, 2013

Controlling Row Output in Talend Open Studio Components


Most of the Talend Open Studio components I've read about support a one-to-one correspondence between input and output.  That is, for each input row, a single output row is written to an outgoing connection.  However, components like tSplitRow will write more than one output row to a connection.

A main flow drives flow-based (versus iterate-based) processing in Talend Open Studio.  For most components, an input record will produce zero or more output records.  An input row to a tReplace produces an output row with fields possibly transformed.  An input row to a tFilterRow produces a single output row that will be routed to either a Filter or a Reject connection.

tScriptRules

tScriptRules is a custom component on the Talend Exchange.  It's functionality is similar to a tFilterRow, but it includes error code and message information along with the rejects.  tScriptRules also accepts Javascript-like syntax in its filtering expressions.

A recent enhancement of tScriptRules added a Run All mode.  The normal mode of operation for tScriptRules is to stop after the first failure and immediately produce a record for a Reject connection..  If there were no failures, a record is produced for the Filter connection. Processing continues with the next record.

The Run All mode will execute all of the rules, not breaking after the first failure. In this case, more than one output record may be written to the Reject connection.

Main Java JETs

Talend Open Studio custom components are built using Java Emitter Templates (JET).  JET is a templating mechanism like Java Server Pages where a basic code structure is output with values substituted at runtime.  The Talend Wiki describes creating custom components with JET.

Code Generation

The following screenshot shows a job with three sample components using 2 connectors.  The sample components "tInputComponent" and "tOuputComponent" aren't found in the Palette or on the Exchange.  Each component is implemented using a _main.javajet.  There are files tInputComponent_main.javajet, tComponent_main.javajet, and tOutputComponent_main.javajet.

Sample Job

This pseudocode (not Java code) shows conceptually how Talend Open Studio will generate a runnable job.  This pseudocode is for the normal (not Run All) mode of operation.

FOREACH row1 IN tInputComponent

  DO tInputComponent processing

  DO tScriptRules processing

  DO tOutputComponent processing

END FOREACH row1

Each component's  _main.javajet precedes up the downstream _main.javajets, ending with an output component.  Talend Open Studio controls the arrangement of the _main.javajets, creating the looping structures including the closing of the loops.  Each "DO" processing step usually examines the input row (row1) and produces a transformed output row (row2).

Controlling the Loop

To support the Run All mode of tScriptRules, I introduced a second loop.  The inner loop considers each rul in the tScriptRules processing, possibly invoking the tOutputComponent record writing processing more than once.

FOREACH row1 IN tInputComponent

  DO tInputComponent processing

  DO tScriptRules processing

  FOREACH failure from the tScriptRules processing

    DO tOutputComponent processing

  END FOREACH failure

END FOREACH row1

Instead of relying on Talend Open Studio to link together the three components in the generated code, tScriptRules opens up a second loop ("FOREACH failure").  Talend Open Studio then inserts the tOutputComponent processing.

This is the snippet of code in tScriptRules that sets up the loop to write out to the Reject connection.

for( Object row_<%=cid%> : savedRows_<%= cid %> ) { // R_02

   <% if( rejectRowName != null ) { %>
   if( hasFailures_<%= cid %> )
    <%= rejectRowName%> = (<%= rejectRowName %>Struct)row_<%=cid%>;


But there is a problem.

BlockCode

In designing custom components, there should be no coupling of one component to another.  This contract is essential so that TOS components can be mixed and matched by developers in any allowed combination.  The problem with the preceding snippet of tScriptRules occurs during code generation.  A block of code is opened (the loop), Talend inserts the processing of the downstream components, but the loop is not closed.  An error results from the illegally formed Java source file.

The fix is to use a Talend API call to record the opened loop which will later prompt Talend to close it after all of the downstream components, including the output component, have been assembled.  This is taken from tScriptRules.

<%
List blockCodes = new java.util.ArrayList(1);
blockCodes.add(new BlockCode("R_02"));
blockCodes.add(new BlockCode("R_01"));
((org.talend.core.model.process.AbstractNode) node).setBlocksCodeToClose(blockCodes);
}
%>


The "blockCodes.add" with the R_02 argument will indicate to Talend that a code block needs to be closed.  The R_02 will be realized in a comment and is useful for debugging.  The convention is to mark the opening loop with the same marker so that the generated Java source can be matched.

tScriptRules also uses logic with the inner loop.  This is the R_01 code block.  An if statement is opened up that will also need to be closed.  (It there is no input, skip the inner loop.)  When you're writing your own loops to use in custom components, be sure to handle all of the connection cases, especially where a connection is missing.

For additional resources on this topic, download tScriptRules from the Exchange or examine tSplitRow in your local TOS installation.  Be sure to mark your opened loops or if statements with a comment.  It can be difficult to write Talend javajet code because of the similarity between the templating (Java) and the generated product (also Java).

No comments: