A reader asked how to extract the bond type "CD" from the following input string: "CD Corporation du 20/12/2010 4.5% à 26 semaines". Although it's easy to grab the first two characters in a tMap using a substring function, there is an off-the-shelf component tExtractRegexFields that can handle varying lengths.
This job uses a tFixedFlowInput to provide data. The tFixedFlowInput is run into a tExtractRegexFields which breaks the input into two strings: investment type and remainder. The tExtractRegexFields is connected to a tMap which filters the columns. The result is output to tLogRow.
|
Input Data for Regex-parsing Job |
The test data consists of four records representing four French investment types: CD, OAT, BTF, BTAN. Types vary in character length: 2, 3, 4. While a simple substring() in a tMap is a quick way to pluck the first two characters off of a string, that solution won't work for the varying characters. Regular expressions can be used to handle the variety. A regular expression keys off of the repeating structure of the input data rather than the fixed positions.
|
tExtractRegexFields |
Regular expression syntax can be intimidating, and is best learned by breaking down examples. Fortunately, nearly every toolkit from Javascript to Perl has an easy way to work with regular expression. In the Regex supporting this example, the investment type (CD, OAT, etc) is identified as a sequence of word characters (A-Z, a-z, 0-9). The regular expression will match the characters until it encounters one not in the set (in this case, the space character). The following ".*" will grab everything at the end.
The parenthesis characters denote the column match groups. Everything in the first set of parens -- the investment type w+ -- is the first group. The remainder is the second group. Here is the outbound schema used in the tExtractRegexFields component.
|
Schema Used by tExtractRegexFields |
The tMap isn't essential. I'm using it to filter the remainder column.
|
A tMap Filtering a Column |
Finally, here is the output generated when the job is run. tLogRow provides the writing.
|
Input String Parsed to Define an Investment Type |
Regular expressions are enormously power. Thankfully, they're available in most languages. The best way to learn the regular expressions is to break down examples using an interpreter like Perl or a Javascript program. Although many regular expression implementations will be the same in different environments, there is sometimes a syntactic different between a Perl and a Java regular expression. The most notable is the need to escape the important backslash ('\') character with a second backslash to form valid Java strings.
No comments:
Post a Comment