If you need to process URLs with Talend Open Studio, a few well-placed components can break apart the URL parameters to be stored, converted, or filtered.
Given the following URL
/google.se/url?sa=t&rct=j&q=insights%20konsult&source=web&cd=11&ved=0CC4QFjAAOAo&url=http%3A%2F%2Fwww.inuseinsights.se%2Fom-inuse-insights%2Fpartners
You can break the string apart with 3 Talend Open Studio components that will result in a stream of name/value pairs. This screenshot shows the running of such a job.
 |
Name / Value Pairs Extracted from a URL |
Depending on requirements, additional columns can be carried through the processing to provide a business key (such as host or path). This screenshot shows the job.
 |
Job Parsing a URL - Two Extra Delimited Fields and a tNormalize |
Three components hack off various pieces of the URL. First, a tExtractDelimitedFields_1 separates the host/path from the QUERY_STRING using the "?" delimiter. Next, a tNormalize takes each name/value pair, forming a distinct row based on the "&" delimiter. Finally, the second tExtractDelimitedFields_2 separates the name from the value, based on "=".
The tFilterColumns component is used for presentation purposes, it removes the pre-processed "path" variable.
Here are the component configurations starting with the tFixedFlow component providing the test data.
 |
A tFixedFlowInput with a URL |
|
|
|
 |
tExtractDelimitedFields_1 |
 |
tNormalize | |
 |
tExtractDelimitedFields_2 |
While some custom Java can be thrown into a tJavaRow, this blog post presents a cleaner alternative. It's cleaner because it's based on the schema, rather than some Java code that could suffer a syntax error.
No comments:
Post a Comment