Tuesday, June 11, 2013

Parsing a URL with Talend Open Studio

Parsing a URL with Talend Open Studio

If you need to process URLs with Talend Open Studio, a few well-placed components can break apart the URL parameters to be stored, converted, or filtered.

Given the following URL

/google.se/url?sa=t&rct=j&q=insights%20konsult&source=web&cd=11&ved=0CC4QFjAAOAo&url=http%3A%2F%2Fwww.inuseinsights.se%2Fom-inuse-insights%2Fpartners

You can break the string apart with 3 Talend Open Studio components that will result in a stream of name/value pairs.  This screenshot shows the running of such a job.

Name / Value Pairs Extracted from a URL
Depending on requirements, additional columns can be carried through the processing to provide a business key (such as host or path).  This screenshot shows the job.

Job Parsing a URL - Two Extra Delimited Fields and a tNormalize
Three components hack off various pieces of the URL.  First, a tExtractDelimitedFields_1 separates the host/path from the QUERY_STRING using the "?" delimiter.  Next, a tNormalize takes each name/value pair, forming a distinct row based on the "&" delimiter.  Finally, the second tExtractDelimitedFields_2 separates the name from the value, based on "=".

The tFilterColumns component is used for presentation purposes, it removes the pre-processed "path" variable.

Here are the component configurations starting with the tFixedFlow component providing the test data.

A tFixedFlowInput with a URL


tExtractDelimitedFields_1
tNormalize 
tExtractDelimitedFields_2
While some custom Java can be thrown into a tJavaRow, this blog post presents a cleaner alternative.  It's cleaner because it's based on the schema, rather than some Java code that could suffer a syntax error.

No comments: