Thursday, June 13, 2013

Scanning an Input String in Talend Open Studio


Use Java's regular expressions packaged in a Talend Open Studio Routine to scan an input string and perform an advanced manipulation.

Java's regular expressions are powerful and can be used to handle string manipulation that exceeds the capabilities of split() or replace().  A pattern to follow for this type of manipulation is to build a new string based on applying a regular expression to the input.

The regular expression should be packaged as a Talend Routine which is available in components like tMap.

For example, consider the following input as expressed in the variables n1, n2, and n3.  This Java could be embedded in a tJava component.

String n1 = "Sec1&lib1$$Sec2&lib2";
String n2 = "Sec2&lib2$$Sec4&lib4$$Sec6&lib6";
String n3 = "Sec1&lib1$$Sec2&lib2$$Sec3&lib3$$Sec5&lib$$Sec8&lib8$$Sec9&lib9";
 

System.out.println("n1=" + FitlerUtils.filterNonSec(n1));
System.out.println("n2=" + FilterUtils.filterNonSec(n2));
System.out.println("n3=" + FilterUtils.filterNonSec(n3));

filterNonSec needs to pull out the non-Sec parameters.  This includes the "lib" parameters, but the regular expression solution will handle other parameters.  First define a Sec parameter as a regular expression of the form Sec[0-9]+ where "Sec" is followed by one or more digits.  Any character following the digit will serve as the boundary.

Expected Output

The expected output removes extra tokens, but retains the ampersand separator.

n1=Sec1&Sec2
n2=Sec2&Sec4&Sec6
n3=Sec1&Sec2&Sec3&Sec5&Sec8&Sec9

 
Code
 
Use the regular expression function find(), invoked repeatedly, to build up a string.  Sec[0-9]+ defines a group that will return the particular Sec being examined.  For each matching Sec, a StringBuffer is appended with the Sec token (including the number) and a separator.

Note the Java technique to append the separator.  The separator is appended at the beginning if needed.  The  firstPass is skipped and a flag set.

Here is a Talend Routine that can be packaged as "FilterUtils".  Create a Routine "FilterUtils", then swap out the sample static method for this code.


public static String filterNonSec(String _input) {

   if( _input == null || _input.length()==0 ) {
     return "";
   }

   StringBuffer output_sb = new StringBuffer("");

   java.util.regex.Pattern p =
       java.util.regex.Pattern.compile("Sec[0-9]+");

   java.util.regex.Matcher m = p.matcher(_input);

   boolean firstPass = true;
   while( m.find() ) {

     if( !firstPass )
       output_sb.append("&");
     else
       firstPass = false;

     output_sb.append(m.group());
   }

   return output_sb.toString();
}


Regular expressions are a powerful way to handle a complex input string.  In Talend, use a Routine -- rather than chunks of Java code embedded in components -- to package the expressions.  If you're writing a lot of regular expression code, consider a test-driven methodology so that all of the variations of input (empty strings, nulls, etc) can be covered.

No comments: