Tuesday, June 11, 2013

tScriptRulesLoad for Talend Open Studio

tScriptRules is a third-party Talend Open Studio component that applies a set of Javascript rules to an input flow.  The rules for tScriptRules can be defined in the component itself via a table in the Component View.  Alternatively, the tScriptRulesLoad component can be used to retrieve the rules from any input source available to Talend.


Background

tScriptRules was written as a more flexible filtering mechanism that the off-the-shelf component tFilterRow and tMap.  Wiith tScriptRules, you can define filtering rules using a Javascript-like syntax.  Additionally, tScriptRules will add error reporting columns to a reject flow that can produce a data quality report or troubleshoot an error.


The rules in a tScriptRules can be loaded from table elements set up in the component's Component View.  However, when this technique is used, the rules become embedded in the job.  This means that maintenance is more difficult -- you'd have to change Talend code with a rule update -- and the rules are not transferable both within the job and to other jobs.

tScriptRulesLoad

tScriptRulesLoad separates the rules from the Talend job and the component.  This means that the rules can be maintained independently, say in as a text file in a source code control system.  The same rules can be applied in multiple jobs.  The same rules can be applied to multiple tScriptRules components in the same jobs.  In an advanced use case, you can even use the rules loaded from tScriptRulesLoad independently of any tScriptRules components.


Sample Job

In this job, rules are loaded from a text file using a tFileInputDelimited.  The source is a text field 'rules_success.txt'.  tScriptRulesLoad stores the loaded rules in an internal data structure.  This internal data structure is connected to the tScriptRules as a reference "Rule list from" in Component View.

Basic Job Using tScriptRules with tScriptRulesLoad
The tScriptRules in the screenshot is taking an input flow from a tMySqlInput component.  For each input record (row2), tScriptRules is applying a rule loaded from the tScriptRulesLoad subjob.  If the expression evaluates to true, the input flow is routed to the filter flow (a tMySqlOutput target).  Otherwise, the input  is routed to a reject flow with an error code ("1") and message ("field1 ok").

Usage

To use a tScriptRulesLoad component, drag an input source and a tScriptRulesLoad component onto the canvas.  In this case, a tFixedFlowInput is the input source.


Step 1: Add Components to Canvas
Next, connect the tFixedFlowInput_1 Main to the tScriptRulesLoad_1 component.  The tFixedFlowInput_1 component will still show errors as a schema has not been established yet.  (The values in the tFixedFlowInput component have not been added either.)

On the tScriptRulesLoad Component View, click the Edit Schema button.  The following dialog is displayed.

Edit Schema Dialog
Press the double arrow icon in the middle of the dialog to copy the tScriptRulesLoad columns to the tFixedFlowInput component.  A warning dialog "All columns from the output schema..." will be displayed.  Press "ok".  The result is the following.

Columns Copied to Input Source
Finally, define a rule in the tFixedFlowInput.

A Rule Defined in a tFixedFlowInput
If only one tScriptRules component is used, this job can be streamlined to use the embedded table to hold the rule.  However, if more than one tScriptRules component is used, then the rules defined in the tScriptRulesLoad can be applied across both components.  See the following job.  tFixedFlowInput_2 is a single record with "field1=ok".  tFixedFlowInput_3 is a single record with "field1=fail".

Rules Reused Recycled in Two tScriptRules
Rule Definition

Note the use of 'input_row' in the rule definition.  The tScriptRules components will accept this alias for the row name.  You can use the actual row name (row2, row3), but this limits the rule's usage if the actual row name changes or isn't applicable for a particular flow.


The rules must evaluate to true in order to be routed to the filter connection.  If a rule is misapplied (say a rule "row2 == 'ok'"  in tScriptRules_2, the rule will be rejected for each input record by the tScriptRules_2.

Adaptations

This application of tScriptRules load used a tFixedFlowInput.  Any Talend input can serve as an input source for tScriptRulesLoad.  This includes a text file, database, XML document, or even a web service.


If the schema set in the input source is different than what tScriptRulesLoad expexcts -- take a text field with columns "rule,code,msg" rather than "jexlExpresion,reasonCode,reasonMessage" -- use a tMap.  Follow the procedures for the tFixedFlow example with the tMap.  Then map the input source fields (rule/code/msg) to the tScriptRulesLoad fields in the tMap (jexlExpression,reasonCode,reasonMessage).

If you'd like to number the rules automatically, use a tMap where the reasonCode is set from a sequence: the Numeric.sequence function.

Agile Rule Development

One use case I've been experimenting with is defining the rules using a text file input.  The text file can be stored in CM easily with standard diff commands highlighting differences.  Developing Talend Open Studio jobs, I can bring up a text editor (Textpad) alongside my Talend Open Studio.  I can then run and re-run the job making quick text edits to the input source file.


This setup seems to work in an experimenting SQL-like fashion to put a "where" clause on a input source like a web server log file.



In my consulting jobs, I deal with a lot of rules.  They start of simply enough ("field x is required"), but can rapidly grow to complex business logic relating multiple fields from multiple sources.  Hopefully, you find these components useful.  If you have any product suggestions or bugs, please send them to dev@bekwam.com.

No comments: