Tuesday, June 11, 2013

An Embedded Database Logger for Talend Open Studio Jobs


Talend Open Studio can produce a lot of logging information for debugging and the operational monitoring of jobs.  However, it can be difficult to get a summarized report by stringing together log entries.  Use the embedded Derby database to produce a single record of your job's run with a detailed accounting of states, statuses, times, and messages.

update:

This technique has a lot of potential, but there's a bug that caused me to go away from this on a recent consulting job: http://jira.talendforge.org/browse/TDI-18300.  There are workarounds, like shuffling the order of the tJavaDB* components or making each Derby call in a separate process subjob, but to hack up a TOS job in this way negates the benefits mentioned below.


I ended up keeping Derby in this particular consulting job because I needed to join two text files.  I loaded both text files into Derby tables using a "Create and drop" Action and followed this up with a SQL join.  This was a great use of tJavaDBInput and tJavaDBOutput.

Hopefully, I can return to this post with either a report that the bug has been fixed or a link to a new post with an alternative database.

/update:

Derby (also called JavaDB) is a pure Java database that can run in an embedded mode.  That means that there are no extra processes run with Derby.  Also, Derby has a small footprint.  Its 50Mb of files are located in a single folder.  Since the installation isn't complicated and doesn't involve network connections, Derby can be used almost anywhere such as on a file processing server.

Logging with a Single Record

This post presents a job that loads an Excel spreadsheet into a MySQL database.  This can typically be coded with 2 or three TOS components (tFileInputExcel, tMySqlOutput, tMap).  To get a detailed view of what's happening, several components are added in order to provide the following for the job.
  • Start time and end time; includes end time for failure cases
  • Processing status set when job starts and updated when job ends
  • A count of affected rows
Here is the schema used to record the job information.  Each run of a job will produce a single record.  This is the key benefit to using Derby; you don't have to scan a daily log file to find out what happened. 'ij' is a derby command-line client that allows you to issue SQL commands (SELECT, etc.) to the database.

Derby "Session" Showing Schema
The schema is DERBYTEST_LOG(JOB_NAME, PROCESSING_STATE, START_TIME, END_TIME, ROWS_AFFECTED, ERROR_MSG).

Rolling Log

Using a single log record reduces the clutter inherent in keeping a rolling log of events.  The rolling log continually inserts new records as processing progresses.  This is the output for a job that uses a tStatCatcher to record times during processing steps.

Capturing Job Information in a Rolling Log

A Job Using Derby

The following screenshot shows a job that inserts a DERBYTEST_LOG record into the database and repeatedly updates the record based on different conditions.  Note the use of the tFlowMeter and tLogCatcher exception handlers.  These are run at the end of the process with tLogCatcher handling error conditions.

A Job Logging Program State to a DERBYTEST_LOG Record
 There is a record upsert and update surrounding the main processing of the job.  Two tRowGenerator components update the status and use a TalendDate.getCurrentDate() call to set the DERBYTEST_LOG record.


Successful Run Information Recorded

A tFlowMeter component is inserted between the tMap and the tMySqlOutput components.  This is recording how many records have come across that flow.  The results are communicated to a tFlowMeterCatcher component which also updates the DERBYTEST_LOG.  A tMap is used because the fields in the tLogCatcher schema are named differently than those in DERBYTEST_LOG.

A tLogCatcher component will catch any die or Java exception events.  This will update the DERBYTEST_LOG status to "FAILED" and set the end time of the failure.  An error message is also written out.

Failed Run with Error Message Recorded

Trade Off

Embedded Derby is not a database server, so it doesn't support concurrent connections.  The ij client program used to run the SQL SELECT statements in the above screen shots could not be left running when the TOS job started.  Similarly, in order to run a query on the DERBYTEST_LOG table, the TOS job should not be running.

If your Talend Open Studio job needs to keep track of its program state or be made aware of other jobs that may be running, a small embedded database can help.  A database like Derby lets you perform updates and queries throughout your TOS job that can help reduce the amount of custom Java code and components in your job.  Derby (also called JavaDB) doesn't add any processes or network connections to your operational environment.  It deploys to 50Mb of files in a single folder.

No comments: