Thursday, June 13, 2013

Large Chunks of Text in Talend Open Studio


You can save large blocks of text in Talend Open Studio using database types like Clob (Oracle) or Memo (Microsoft Access).  However, if the input is a text file, you may need additional processing to gather the Clob data.

When storing text, there is a limit in text fields.  In Oracle, there is a 4k limit on storing text in a VARCHAR2 field.  The limit in a Microsoft Access text field is 255 characters.  Both databases have an alternate storage system for keeping larger amounts of text.  Oracle has the Clob and Access has the Memo.  Selecting Clob or Memo over VARCHAR2 or Text may affect the type of operations you can do on the field like sorting or comparisons.

This distinction between the storage types isn't important to Talend Open Studio.  Retrieve the metadata from the database table which will create a field of type String or Object.  From a coding perspective, these are treated identically since a Java String is an Object.

One difference in dealing with large amounts of text occurs on the input side.  If the text is coming from a large file or web service, the components used in Talend for input may have a problem if the text contains linebreaks.  Components like tFileInputDelimited or tFileInputFullRow will not necessarily map a whole file to a Clob field.

In this job, a file's contents are gathered in a global variable called "stringBuffer" which is a java.lang.StringBuffer. The StringBuffer is written out to a database's Clob field (in this example an Access "Memo").  The writing is assisted with a tRowGenerator.

Reading a Large File into a Database Clob
The tSetGlobalVar defines the "stringBuffer" global variable used throughout the job.

tSetGlobalVar Defines a StringBuffer
The tFileInputFullRow is configured to read a file in row-by-row.  The resulting flow is sent to a tJava that gathers each line and appends them into the global StringBuffer.  Note that this approach has the side effect of removing linebreaks from the data.  If you need the linebreaks preserved, you'll need to add them in with a second StringBuffer call: stringBuffer.append("\n").

Unpack the StringBuffer and Append a Line
The tFileInputFullRow / tJavaRow subjob ends with the global variable "stringBuffer" containing the entire contents of the text file.  The successful completion triggers a second subjob that will load the contents into the database.

A tRowGenerator is used to produce a single row.  This is a flow adapter that will allow a tAccessOutput component to be used.  The tRowGenerator maps hardcoded recordName and a dynamic recordData fields.  The recordData field is formed by retrieving the stringBuffer global variable and calling the toString() method to produce a String for use in a tMap.

tRowGenerator Forming a Row Based on Global Var
Lastly, the tMap maps the tRowGenerator fields to the target database.  For Access, TOS uses a String to represent the Memo field.  In Oracle, TOS uses a Clob for the field.

Straight-forward tMap
The algorithm for this works with medium-sized files.  Gather file contents in a StringBuffer and output in a separate subjob.  This approach was tested with 200k files.  There was an OutOfMemory exception thrown when tried with a 100Mb file, however, the laptop on which the trail was run was fairly limited.

No comments: