Talend Open Studio can produce a lot of logging information for debugging and the operational monitoring of jobs. However, it can be difficult to get a summarized report by stringing together log entries. Use the embedded Derby database to produce a single record of your job's run with a detailed accounting of states, statuses, times, and messages.
update:
This technique has a lot of potential, but there's a bug that caused me to go away from this on a recent consulting job: http://jira.talendforge.org/browse/TDI-18300. There are workarounds, like shuffling the order of the tJavaDB* components or making each Derby call in a separate process subjob, but to hack up a TOS job in this way negates the benefits mentioned below.
I ended up keeping Derby in this particular consulting job because I needed to join two text files. I loaded both text files into Derby tables using a "Create and drop" Action and followed this up with a SQL join. This was a great use of tJavaDBInput and tJavaDBOutput.
Hopefully, I can return to this post with either a report that the bug has been fixed or a link to a new post with an alternative database.
/update:
Derby (also called JavaDB) is a pure Java database that can run in an embedded mode. That means that there are no extra processes run with Derby. Also, Derby has a small footprint. Its 50Mb of files are located in a single folder. Since the installation isn't complicated and doesn't involve network connections, Derby can be used almost anywhere such as on a file processing server.
Logging with a Single Record
This post presents a job that loads an Excel spreadsheet into a MySQL database. This can typically be coded with 2 or three TOS components (tFileInputExcel, tMySqlOutput, tMap). To get a detailed view of what's happening, several components are added in order to provide the following for the job.
The schema is DERBYTEST_LOG(JOB_NAME, PROCESSING_STATE, START_TIME, END_TIME, ROWS_AFFECTED, ERROR_MSG).
Rolling Log
Using a single log record reduces the clutter inherent in keeping a rolling log of events. The rolling log continually inserts new records as processing progresses. This is the output for a job that uses a tStatCatcher to record times during processing steps.
A Job Using Derby
The following screenshot shows a job that inserts a DERBYTEST_LOG record into the database and repeatedly updates the record based on different conditions. Note the use of the tFlowMeter and tLogCatcher exception handlers. These are run at the end of the process with tLogCatcher handling error conditions.
There is a record upsert and update surrounding the main processing of the job. Two tRowGenerator components update the status and use a TalendDate.getCurrentDate() call to set the DERBYTEST_LOG record.
A tFlowMeter component is inserted between the tMap and the tMySqlOutput components. This is recording how many records have come across that flow. The results are communicated to a tFlowMeterCatcher component which also updates the DERBYTEST_LOG. A tMap is used because the fields in the tLogCatcher schema are named differently than those in DERBYTEST_LOG.
A tLogCatcher component will catch any die or Java exception events. This will update the DERBYTEST_LOG status to "FAILED" and set the end time of the failure. An error message is also written out.
Trade Off
Embedded Derby is not a database server, so it doesn't support concurrent connections. The ij client program used to run the SQL SELECT statements in the above screen shots could not be left running when the TOS job started. Similarly, in order to run a query on the DERBYTEST_LOG table, the TOS job should not be running.
If your Talend Open Studio job needs to keep track of its program state or be made aware of other jobs that may be running, a small embedded database can help. A database like Derby lets you perform updates and queries throughout your TOS job that can help reduce the amount of custom Java code and components in your job. Derby (also called JavaDB) doesn't add any processes or network connections to your operational environment. It deploys to 50Mb of files in a single folder.
update:
This technique has a lot of potential, but there's a bug that caused me to go away from this on a recent consulting job: http://jira.talendforge.org/browse/TDI-18300. There are workarounds, like shuffling the order of the tJavaDB* components or making each Derby call in a separate process subjob, but to hack up a TOS job in this way negates the benefits mentioned below.
I ended up keeping Derby in this particular consulting job because I needed to join two text files. I loaded both text files into Derby tables using a "Create and drop" Action and followed this up with a SQL join. This was a great use of tJavaDBInput and tJavaDBOutput.
Hopefully, I can return to this post with either a report that the bug has been fixed or a link to a new post with an alternative database.
/update:
Derby (also called JavaDB) is a pure Java database that can run in an embedded mode. That means that there are no extra processes run with Derby. Also, Derby has a small footprint. Its 50Mb of files are located in a single folder. Since the installation isn't complicated and doesn't involve network connections, Derby can be used almost anywhere such as on a file processing server.
Logging with a Single Record
This post presents a job that loads an Excel spreadsheet into a MySQL database. This can typically be coded with 2 or three TOS components (tFileInputExcel, tMySqlOutput, tMap). To get a detailed view of what's happening, several components are added in order to provide the following for the job.
- Start time and end time; includes end time for failure cases
- Processing status set when job starts and updated when job ends
- A count of affected rows
Derby "Session" Showing Schema |
Rolling Log
Using a single log record reduces the clutter inherent in keeping a rolling log of events. The rolling log continually inserts new records as processing progresses. This is the output for a job that uses a tStatCatcher to record times during processing steps.
Capturing Job Information in a Rolling Log |
A Job Using Derby
The following screenshot shows a job that inserts a DERBYTEST_LOG record into the database and repeatedly updates the record based on different conditions. Note the use of the tFlowMeter and tLogCatcher exception handlers. These are run at the end of the process with tLogCatcher handling error conditions.
A Job Logging Program State to a DERBYTEST_LOG Record |
Successful Run Information Recorded |
A tFlowMeter component is inserted between the tMap and the tMySqlOutput components. This is recording how many records have come across that flow. The results are communicated to a tFlowMeterCatcher component which also updates the DERBYTEST_LOG. A tMap is used because the fields in the tLogCatcher schema are named differently than those in DERBYTEST_LOG.
A tLogCatcher component will catch any die or Java exception events. This will update the DERBYTEST_LOG status to "FAILED" and set the end time of the failure. An error message is also written out.
Failed Run with Error Message Recorded |
Trade Off
Embedded Derby is not a database server, so it doesn't support concurrent connections. The ij client program used to run the SQL SELECT statements in the above screen shots could not be left running when the TOS job started. Similarly, in order to run a query on the DERBYTEST_LOG table, the TOS job should not be running.
If your Talend Open Studio job needs to keep track of its program state or be made aware of other jobs that may be running, a small embedded database can help. A database like Derby lets you perform updates and queries throughout your TOS job that can help reduce the amount of custom Java code and components in your job. Derby (also called JavaDB) doesn't add any processes or network connections to your operational environment. It deploys to 50Mb of files in a single folder.
No comments:
Post a Comment