Tuesday, June 11, 2013

Getting Alfresco Contents with Talend Open Studio


Make a RESTful web service call with Talend Open Studio to get a list of the contents of an Alfresco Space.  Pass the results along to tExtractXMLField to break out the results for further processing.

Alfresco is built on a rich set of web services which are available for developers in the form of a RESTful API.  Talend Open Studio can make RESTful web services calls using the tREST component which saves an HTTP response in a schema's field as a chunk of text.  tExtractXMLField can parse this chunk of text (an XML document) into something usable.

Alfresco Spaces

An Alfresco Space is a hierarchical container like a folder or a directory.  RESTful web service calls can access an Alfresco Space by forming a URL from the Space's position in the hierarchy.  For example, the following screenshot shows a Space CLIENT1 that is in the Clients Space which in turn is in the toplevel Company Home Space.


CLIENT1 contains a child Space, Input, and a content file TestHTML.

Alfresco Space CLIENT1 Contents
To retrieve the contents of CLIENT1 programatically, make a RESTful call using Talend Open Studio's tREST component.  You can use tHttpRequest if the response is small, but a bug is preventing a large response from  being downloaded fully.  tAlfrescoOutput, despite its name, doesn't seem to work either.  tAlfrescoOutput's source builds a hardcoded RESTful path that doesn't seem to be supported in the current version of Alfresco.

Talend Job

This job is based on tREST which retrieves an XML document returned by an Alfresco call.  The tREST is routed to a tExtractXMLField to break the document up into individual fields.  The fields are directed to a tLogRow.  A tLoadLibrary is used to introduce a routine that will Base64-encode the username and password.


tREST Job Calling Alfresco
tREST

The following screenshot shows the configuration for the tREST component.  A path is built using the hierarchical Spaces.  This installation of Alfresco protects these Spaces using Basic Authentication.  An HTTP header is used to pass along Authorization credentials.  A Base64-encoded String of the form "username: password" (note the colon) is the argument.  The String encoding is performed by Commons Codec.


tREST Configuration
XML Processing

tREST saves its results as a field "Body" in a schema (in memory).  This Body field can be directed to a tExtractXMLField where XPaths will break the XML document Body into individual fields.  Alfresco uses namespaces, and these are critical to the successful operation of the tExtractXMLField.  To study the namespaces, I hit the URL in the tREST in the browser and brought it into LiquidXML Studio for analysis.

Breaking Apart XML Response from Alfresco
is a repeating structure in the Alfresco response.  There is a straightforward element based on the same target namespace.  The objectTypeId field is based on several elements that belong to the namespace "http://docs.oasis-open.org/ns/cmis/core/200908/" which is referenced by prefix "cmis".  In some XML component, you can use your own prefix, but not tExtractXMLField.<br><br>LiquidXML Studio has a tool for forming XPaths by selecting one of the desired elements to be returned.<br><br><b>Libraries</b><br><br>In order to form an Authorization HTTP header, the username and password need to be Base64-encoded.  I'm doing this with a JAR file called <a href="http://commons.apache.org/codec/" style="text-decoration: none; color: rgb(136, 136, 136);">Commons Codec</a>.  These two screenshots show the reference to the JAR file.  I'm also doing a static import to shorten the amount of typing needed in my tREST HTTP header section.<br><br><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="padding: 5px; position: relative; border: 1px solid rgb(238, 238, 238); -webkit-box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px; box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px; color: rgb(34, 34, 34); margin-left: auto; margin-right: auto; text-align: center;"><tbody> <tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6XuzgpZTeUKuGxVuUyAIvTT6WKP6kxUDv_wbLsEBd4Eo1v0WX9rLR9IxTkl7hmxpdECYxgNKWqlZ4-ilVrYojJNmAyHXrzGidn-DUCX1g6KDcy_oPEvhbuUhMHeZZRKK1mwBaxLbxiPLb/s1600/bk_alf_tlib1.png" imageanchor="1" style="text-decoration: none; color: rgb(136, 136, 136); margin-left: auto; margin-right: auto;"><img border="0" height="86" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6XuzgpZTeUKuGxVuUyAIvTT6WKP6kxUDv_wbLsEBd4Eo1v0WX9rLR9IxTkl7hmxpdECYxgNKWqlZ4-ilVrYojJNmAyHXrzGidn-DUCX1g6KDcy_oPEvhbuUhMHeZZRKK1mwBaxLbxiPLb/s320/bk_alf_tlib1.png" width="320" style="border: none; position: relative; padding: 0px; background-color: transparent; -webkit-box-shadow: rgba(0, 0, 0, 0.0980392) 0px 0px 0px; box-shadow: rgba(0, 0, 0, 0.0980392) 0px 0px 0px; background-position: initial initial; background-repeat: initial initial;"></a></td></tr> <tr><td class="tr-caption" style="font-size: 11px;">Loading a JAR File<br><br><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-_d2NG7kxDSi0wRi9GOigs8E7QxWlx8mdn4pGSIXAFEt8Sv1ZrPVUoHBBF-G2iZSxR6NWrD1Zh0jj1U2he42Tn_rUOki4RcZdMdNMDERJRm3Ofv-8TQqMinLCmYm3JpcmHCbL6Y5kWdKK/s1600/bk_alf_tlib2.png" imageanchor="1" style="text-decoration: none; color: rgb(136, 136, 136); margin-left: auto; margin-right: auto;"><img border="0" height="31" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-_d2NG7kxDSi0wRi9GOigs8E7QxWlx8mdn4pGSIXAFEt8Sv1ZrPVUoHBBF-G2iZSxR6NWrD1Zh0jj1U2he42Tn_rUOki4RcZdMdNMDERJRm3Ofv-8TQqMinLCmYm3JpcmHCbL6Y5kWdKK/s320/bk_alf_tlib2.png" width="320" style="border: none; position: relative; padding: 0px; background-color: transparent; -webkit-box-shadow: rgba(0, 0, 0, 0.0980392) 0px 0px 0px; box-shadow: rgba(0, 0, 0, 0.0980392) 0px 0px 0px; background-position: initial initial; background-repeat: initial initial;"></a></td></tr> </tbody></table> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="padding: 5px; position: relative; border: 1px solid rgb(238, 238, 238); -webkit-box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px; box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px; color: rgb(34, 34, 34); margin-left: auto; margin-right: auto; text-align: center;"><tbody> <tr><td></td></tr> <tr><td class="tr-caption" style="font-size: 11px;">Static Import in tLoadLibrary</td></tr> </tbody></table> If you use the tHttpRequest component when the bug is fixed, there's a more convenient way to handle authentication.<b><br></b><br><br><b>Results</b><b><br></b><br><br>The following screenshot shows the results.  From the RESTful API, the two children of CLIENT1 are returned.  One child is the folder "Input".  The other is the content HTML document "TestHTML".<b><br></b><br><br><div class="separator" style="clear: both; text-align: center;"> <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPqOlwuOo8u-P2Pi4KAGMWUDa4d6Zen9cFnp2zpCiieS7VeyGJLfGijsE6gRoypQacT8P3RlKSxJKMyXH42s0jg4Tnk_x99926Lpih3Wy9P9DUO8teLGGIpMIYwcIckFG8z-KRO_ATm8AG/s1600/bk_alf_results.png" imageanchor="1" style="text-decoration: none; color: rgb(136, 136, 136); margin-left: 1em; margin-right: 1em;"><img border="0" height="118" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPqOlwuOo8u-P2Pi4KAGMWUDa4d6Zen9cFnp2zpCiieS7VeyGJLfGijsE6gRoypQacT8P3RlKSxJKMyXH42s0jg4Tnk_x99926Lpih3Wy9P9DUO8teLGGIpMIYwcIckFG8z-KRO_ATm8AG/s320/bk_alf_results.png" width="320" style="border: 1px solid rgb(238, 238, 238); position: relative; padding: 5px; -webkit-box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px; box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px;"></a></div> <br>Talend Open Studio can access Alfresco functionality using the tREST component.  Since the response is XML, you can use Talend's tExtractXMField to break out the fields.  When a bug is fixed with tHttpRequest, another job design is available which will conveniently save the results to a file to be handled by Talend's file-based processing.</div> </div>

No comments: