Tuesday, June 18, 2013

The Regex Lookup Table with Talend Open Studio


 In Talend Open Studio, most matching and joining will be done using well-defined keys like integers (surrogate keys) or business keys (username) using equality.  Sometimes, it's useful to match on regular expressions.  Instead of embedding regular expressions in your tMaps, consider using a more general solution based on java.util.HashMap.

When mapping in Talend Open Studio, you may want to use regular expressions to match a lookup table.  You can apply a regular expression in a tMap, but often there is a need to apply several expressions.  This can lead to a cumbersome structure that isn't easy to reuse in a Job.

For example, the following code shows an ever-increasing tangle of logic that becomes embedded in a component like tMap.

// accruing more and more logic / subject to change
if( input_row.SYSTEM_FUND_SOURCE_NM.matches("Fred.*") ) {
   output_row.VALUE = "LOCAL_GOVERNMENT";
} if( input_row.SYSTEM_FUND_SOURCE_NM.matches("Washington") ) {
   output_row.VALUE = "LOCAL_GOVERNMENT";
} else if (input_row.SYSTEM_FUND_SOURCE_NM.matches("[Ff][Ee][Dd].*") ) {
   output_row.VALUE = "FEDERAL_GOVERNMENT";
}
// ... etc.


It's easier to maintain this type of mapping in a database table or spreadsheet like this.

Lookup Table of Regular Expression Mappings

The typical tMap or tJoin component works well if you want to use a single value with the equality operator.  However, there's no way to compare a regular expression.  You can't switch the "=" out for a "matches()".

Using custom Java code, you can set up a Java data structure called a Map (implemented as a HashMap).  The key of the Map will be a regular expression and the value will be the looked-up value.  The Map will be loaded with the contents of the database table or spreadsheet.  As the source data is read in, more custom Java code will be used to examine the Map and find the appropriate value.

Regex Lookup Job
Regex Lookup Job contains three subjobs.  The first subjob has a single tJava component that creates the Java Map and imports Java libraries.  The second subjob reads an Excel spreadsheet into the Map using a tFileInputExcel and tJavaRow component.  Lastly, a third subjob reads the data -- from an Excel spreadhseet -- and merges it with the regex-powered lookups with another tJavaRow; the data is then printed with tLogRow.

Subjob 1 - Initialize

The following code is added to the Basic and Advanced Settings tab of a tJava component.  The first block creates a Java data structure called a Map using Java generics.  The Advanced Settings include the required imports.

// Basic settings
Map regexMap =
new HashMap();

globalMap.put("regexMap", regexMap);

// Advanced settings
import java.util.Map;
import java.util.HashMap;


Subjob 2 Load Lookup Table

Subjob 2 uses a tFileInputExcel and a tJavaRow to read the Excel spreadsheet into the Java Map.  The tFileInputExcel is the lookup spreadsheet shown at the start of this post and its schema is identical to what is shown.

The tJavaRow component will put a FUND_SOURCE_NM in the Map for each of the regular expressions.  The regular expression will serve as a key.


Map regexMap =
(Map)globalMap.get("regexMap");
 if( row1.REGEXP != null ) {
  regexMap.put( row1.REGEXP, row1.FUND_SOURCE_NM );
}


Subjob 3 Load the Data

The last subjob will take the data and look up the data using the Map created in the first two subjobs.  The subjob starts with a tFileInputExcel, which is the source data.  That feeds into a tJavaRow which will map all of its fields to the output using the Generate Code feature.  An additional field is defined that will contain the value pulled from the Map.

This is the schema behind the tJavaRow component in the third subjob.

tJavaRow Schema with Extra Value Column
 The code behind the tJavaRow will use the generated mapping of input to output combined with the extra column which will hold the looked up value, the result of matching a regular expression.  This functionality will select the first match; this could be adjusted for a more complicated or variable algorithm.

output_row.SYSTEM_FUND_SOURCE_NM = input_row.SYSTEM_FUND_SOURCE_NM;
output_row.FUND_NM = input_row.FUND_NM;
output_row.RESTRICTED = input_row.RESTRICTED;
output_row.REVENUES = input_row.REVENUES;

output_row.VALUE = "";

Map regexMap = (Map)globalMap.get("regexMap");
for( String key : regexMap.keySet() ) {
 if( input_row.SYSTEM_FUND_SOURCE_NM != null ) {
  if( input_row.SYSTEM_FUND_SOURCE_NM.matches(key) ) {
   output_row.VALUE = regexMap.get(key);
   break;
  }
 }
}


Here's the output from running the job.

Run of Regex Lookup Table Job
If you have a simple String to match in Talend Open Studio, embed a regular expression in components like tMap.  However, as the matching becomes more complicated -- particularly if a long list of matches starts to form -- consider maintaining a table of expressions and lookup values.  This example uses a few simple subjobs to set up a tables that can be maintained separate from the processing.

No comments: